Posts

Parallelism thrills but sometimes it kills (Big Data - Spark) !!

Image
Parallelism thrills but sometimes it kills !! There is a rather popular notion of “Speed thrills but it kills”, this generally means sometimes the thing which gives you wings or makes you faster can also become very dangerously catastrophic. This applies to our very own Big data world as well. Applications, especially in-memory processing frameworks like Apache spark. I have recently faced such a confusing scenario where following all the norms framework(parallelism, distributed approach) was causing the application to slow down significantly. So I decided to share the details here which can help you avoid such encounters. Problem Statement : There was a requirement where it was needed to combine multiple dataframes getting generated dynamically to form a single final result which will combine all the individual data points from each of the dataframes.  Code snippet : (Simplified version)  val spark = SparkSession.builder.appName("testSpark").master("local").ge