Problem with Pyspark and Delta Lake Tables unit-tests

Problem with Pyspark and Delta Lake Tables unit-tests

·

2 min read

The integration of Spark and Delta Lake tables is seamless and smooth for the most part. Ran into some issues with unit-tests concerning the creation and update of the tables when run in conjunction with all existing unit-tests:

    @classmethod
    @since(0.4)
    def isDeltaTable(cls, sparkSession, identifier):
        """
        Check if the provided `identifier` string, in this case a file path,
        is the root of a Delta table using the given SparkSession.

        :param sparkSession: SparkSession to use to perform the check
        :param path: location of the table
        :return: If the table is a delta table or not
        :rtype: bool

        Example::

            DeltaTable.isDeltaTable(spark, "/path/to/table")
        """
        assert sparkSession is not None
>       return sparkSession._sc._jvm.io.delta.tables.DeltaTable.isDeltaTable(
            sparkSession._jsparkSession, identifier)
E       TypeError: 'JavaPackage' object is not callable

../../../anaconda3/envs/rfa/lib/python3.8/site-packages/delta/tables.py:433: TypeError

The call to isDeltaTable is blowing up.

The spark session is global to the entire process running. In this case started by pytest. Below is the relevant Spark Delta Lake table configuration:

.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.databricks.delta.schema.autoMerge.enabled", "true") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

Side Note: Add the following or replacing spark.sql.catalog.spark_catalog with below yields varying results.

.config("spark.sql.catalog.local", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

Workaround

The current workaround is to run the Delta Lake specific unit-tests with a separate pytest call.

pytest . --ignore=path\to\test\delta_lake_tests.py
pytest path\to\test\delta_lake_tests.py

It should be noted that adding pytest custom markers to categorize and running those tests by marker (-m) will also fail even though only the selected tests are run. The collection of the tests seems to "pollute" the Spark session.

@pytest.mark.delta_table
def test__write(self):
...
pytest . -m delta_table

References