🔍 Top 3 Spark Performance Issues

1. ⚖️ Skew

  • Skew is an uneven distribution of data across partitions.
  • In a dataset where one value occurs more frequently than others, this can lead to uneven distribution of tasks across the workers.
  • It can lead to slow tasks (parallelism not used enough), or the task can timeout and fail.

Solution:

  • Enable adaptive query execution that automatically detects and handles skew.
  • Use salting to distribute data evenly. By creating an artificial column, you can split the frequently occurring values into multiple partitions.
  • Filter out and process skewed values separately.

2. 🔄 Shuffle

  • Most expensive operation in Spark.
  • Costly data movement across the cluster
  • Happens when you call an operation that requires an exchange of data between the nodes (e.g., groupby and join).

Solution:

  • Filter early.
  • Use broadcast join when possible (when one of the tables is small enough, you can send it to all the nodes for instant retrieval).
  • Use partitioning to reduce the amount of data that needs to be shuffled.
  • Use window functions instead of groupby & joins when possible.

3. 💾 Spill

  • Happens when the data doesn’t fit in memory of a executor.
  • Memory limits → disk usage -> slow tasks

Solution:

  • Select only the columns you need, filter early.
  • Reduce cardinality (the number of unique values in a column).
  • Increase memory per core.

⚠️ Common Pitfalls

1. 🗂️ Too many/ too few partitions

  • Too many small files cause issues with file reads and metadata management, negatively impacting query performance.
  • Few large files result in poor parallelism.
  • ✅ Use auto-compaction (e.g., when using Delta tables).
  • On Databricks, you can use the OPTIMIZE command to compact small files into larger files and improve query performance.

2. Setting the wrong partition

  • Best to partition on columns with low to moderate cardinality (dozens to thousands of values)
  • When designing the data ingestion process, work from the end goal in mind.
  • To pick the right column for partitioning, consider the consumption patterns, including most frequently used filters, joins, and aggregations.

3. Following Spark anti-patterns

  • Iterating over rows. Is a no-no as it can’t be parallelised.
  • Using UDFs when not needed. When possible, use Spark built-in functions instead.
  • Calling collect() or .toPandas() on big datasets—defeats the purpose of using Spark.

4. Not using caching when you should

  • Due to Spark’s lazy evaluation, whenever you call an action, the data is cleared from memory.
  • To avoid expensive recomputation after you call an action, cache the DataFrame if you plan to use it multiple times; otherwise, it will be recomputed every time.