Optimizing Pytest and Pyspark/Spark

Optimizing Pytest and Pyspark/Spark

·

3 min read

Our current tests (302) are running on average of of ~55 minutes. This is a combination of unit and end-to-end tests. My dev machine is a Dell Precision 5550 with a Core i9-10885H running at 2.4Ghz. Has 64GB of memory and a fairly fast SSD. We are using pytest as the test framework.

time pytest .
real    53m40.947s
user    3m3.985s
sys     3m43.023s

Interesting note that this is a run from WSL 2.0 with Ubuntu 20.04:

real    32m12.278s
user    11m27.826s
sys     35m6.678s

Optimizing Spark

The SparkSession configuration used differs from unit vs E2E tests. That is E2E tests had larger allotment of memory for the driver and other relevant changes. Some googling led me some configurations that could help the in memory runs of Spark:

self._session = SparkSession.builder.appName(name) \
    .master("local[*]") \
    .config("spark.sql.execution.arrow.enabled", "false") \
    .config("spark.cores.max", "8") \
    .config("spark.executor.heartbeatInterval", "3600s") \
    .config("spark.storage.blockManagerSlaveTimeoutMs", "4200s") \
    .config("spark.network.timeout", "4200s") \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memory", "6g") \
    .config("spark.memory.offHeap.enabled", "false") \
    .config("spark.memory.offHeap.size", "2g") \
    .config("spark.ui.showConsoleProgress", "false") \
    .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
    .getOrCreate()

This was updated with the following changes:

Run Spark locally with as many worker threads as logical cores on the machine.

.master("local[*]") \
.config('spark.sql.shuffle.partitions', 1) \
.config('spark.default.parallelism', 1) \
.config('spark.rdd.compress', False) \
.config('spark.shuffle.compress', False) \.config('spark.shuffle.spill.compress', False) \
.config('spark.dynamicAllocation.enabled', False) \
.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \

These two configurations made the most difference in test execution time:

  • spark.sql.shuffle.partitions - Should be set to a lower number for local run tests. Default is 200.
  • spark.default.parallelism - Should be set to a lower number. Default is 200.

  • spark.serializer - The Kyro serializer is supposed to be faster, but wasn't able to gauge the impact.

These changes brought the test execution time down to:

real    7m53.697s
user    8m37.724s
sys     27m16.182s

Execution time decreased about ~77%!

I had run various unit tests individually with the tweaks to the config. I haven't found any other configurations and tweaks to get better than this.

If we were using pytest-spark the the partition and parallelize options by defaulted to 1.

Parallelizing Tests

In order to run tests in parallel, Pytests requires an additional package installed.

pip install pytest-xdist

Alternatively pytest-parallel package could be used depending on your what type of tests you have. This did not work for our tests.

Running the following actually yielded worse times.

pytest -n 8 .

This is because of how pytests decides the run the tests. Running with the --dist argument to specify strategy for splitting up the tests for execution yielded some additional improvements (averaged from 3 runs):

time pytest -n 8 --dist=loadfile .

real    6m22.382s
user    11m22.218s
sys     30m0.397s
time pytest -n 8 --dist=loadscope .

real    5m44.324s
user    9m47.262s
sys     25m0.177s

Next Steps

By categorizing/marking the tests to differentiate between categories of tests we improve the execution time by not running tests that aren't relevant. Refactor tests that use fixtures to not run them more than once in a particular test session. We can abstract the E2E tests to actually run on Spark clusters in the cloud which may yield better performance for those that tests that run against large dataframes.

References