Updating to Pyspark 3.3.1 and Python 3.10
ImportError: cannot import name PickleSerializer
Recently revisited updating the M1 Macbook Air to use the latest version of Pyspark and Python 3.10. Ran into some issues with Delta Table references and tests which were resolved when updating the delta-spark
package to the latest version.
pip install delta-spark==2.2.0
The subsequent tests were failing with:
ImportError: cannot import name PickleSerializer
This was coming from this import:
from pyspark import PickleSerializer, MarshalSerializer
I was able to look at the definition of the MarshalSerializer
under serializers.py
class PickleSerializer(FramedSerializer):
"""
Serializes objects using Python's pickle serializer:
http://docs.python.org/2/library/pickle.html
This serializer supports nearly any Python object, but may
not be as fast as more specialized serializers.
"""
def dumps(self, obj):
return pickle.dumps(obj, pickle_protocol)
def loads(self, obj, encoding="bytes"):
return pickle.loads(obj, encoding=encoding)
Looks like PickleSerializers
is defined. Looking at line 353
:
if sys.version_info < (3, 8):
CPickleSerializer = PickleSerializer
else:
CPickleSerializer = CloudPickleSerializer
Looks like versions of python below 3.8 use PickleSerializer
and those above use CloudPickleSerializer
.
Updated the one test and it now succeeds.
Looks like the default serializer is now CloudPickleSerializer
.
"""
PySpark supports custom serializers for transferring data; this can improve
performance.
By default, PySpark uses :class:`CloudPickleSerializer` to serialize objects using Python's
`cPickle` serializer, which can serialize nearly any Python object.
Other serializers, like :class:`MarshalSerializer`, support fewer datatypes but can be
faster.
"""
TODO: Look at performance tuning with different serializers.