Our current tests (302) are running on average of of ~55 minutes. This is a combination of unit and end-to-end tests. My dev machine is a Dell Precision 5550 with a Core i9-10885H running at 2.4Ghz. Has 64GB of memory and a fairly fast SSD. We are using pytest
as the test framework.
time pytest .
real 53m40.947s
user 3m3.985s
sys 3m43.023s
Interesting note that this is a run from WSL 2.0 with Ubuntu 20.04:
real 32m12.278s
user 11m27.826s
sys 35m6.678s
Optimizing Spark
The SparkSession configuration used differs from unit vs E2E tests. That is E2E tests had larger allotment of memory for the driver and other relevant changes. Some googling led me some configurations that could help the in memory runs of Spark:
self._session = SparkSession.builder.appName(name) \
.master("local[*]") \
.config("spark.sql.execution.arrow.enabled", "false") \
.config("spark.cores.max", "8") \
.config("spark.executor.heartbeatInterval", "3600s") \
.config("spark.storage.blockManagerSlaveTimeoutMs", "4200s") \
.config("spark.network.timeout", "4200s") \
.config("spark.driver.memory", "6g") \
.config("spark.executor.memory", "6g") \
.config("spark.memory.offHeap.enabled", "false") \
.config("spark.memory.offHeap.size", "2g") \
.config("spark.ui.showConsoleProgress", "false") \
.config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
.getOrCreate()
This was updated with the following changes:
Run Spark locally with as many worker threads as logical cores on the machine.
.master("local[*]") \
.config('spark.sql.shuffle.partitions', 1) \
.config('spark.default.parallelism', 1) \
.config('spark.rdd.compress', False) \
.config('spark.shuffle.compress', False) \.config('spark.shuffle.spill.compress', False) \
.config('spark.dynamicAllocation.enabled', False) \
.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
These two configurations made the most difference in test execution time:
spark.sql.shuffle.partitions
- Should be set to a lower number for local run tests. Default is 200.spark.default.parallelism
- Should be set to a lower number. Default is 200.spark.serializer
- The Kyro serializer is supposed to be faster, but wasn't able to gauge the impact.
These changes brought the test execution time down to:
real 7m53.697s
user 8m37.724s
sys 27m16.182s
Execution time decreased about ~77%!
I had run various unit tests individually with the tweaks to the config. I haven't found any other configurations and tweaks to get better than this.
If we were using pytest-spark
the the partition and parallelize options by defaulted to 1.
Parallelizing Tests
In order to run tests in parallel, Pytests requires an additional package installed.
pip install pytest-xdist
Alternatively pytest-parallel
package could be used depending on your what type of tests you have. This did not work for our tests.
Running the following actually yielded worse times.
pytest -n 8 .
This is because of how pytests decides the run the tests. Running with the --dist
argument to specify strategy for splitting up the tests for execution yielded some additional improvements (averaged from 3 runs):
time pytest -n 8 --dist=loadfile .
real 6m22.382s
user 11m22.218s
sys 30m0.397s
time pytest -n 8 --dist=loadscope .
real 5m44.324s
user 9m47.262s
sys 25m0.177s
Next Steps
By categorizing/marking the tests to differentiate between categories of tests we improve the execution time by not running tests that aren't relevant. Refactor tests that use fixtures to not run them more than once in a particular test session. We can abstract the E2E tests to actually run on Spark clusters in the cloud which may yield better performance for those that tests that run against large dataframes.