Pyspark and Implicit String/Boolean Conversion

Pyspark and Implicit String/Boolean Conversion

·

2 min read

There was some uncertainty on whether or not the following line actually interprets returns back a dataframe that has the correct value. All testing below conducted with Pyspark 3.1.1:

df = some_dataframe.where(f.col('SomeColumn') == 'true')

If the value of SomeColumn was string datatype, I would expect it to return a dataframe and row matching true. This is indeed the case.

If the value of SomeColumn was boolean datatype, I would expect it to possibly follow the Python interpretation which is any non-null or non-empty string as True otherwise False. This is not the case.

Boolean DataType Column Filter with String

some_dataframe: DataFrame = ss.createDataFrame([
    Row('1', True),
    Row('2', False),
    ], ['ID', 'SomeBooleanColumn']
)

print(some_dataframe.dtypes)

a= some_dataframe.where(f.col('SomeBooleanColumn') == 'true')
a.show()

b = some_dataframe.where(f.col('SomeBooleanColumn') == 'false')
b.show()

Schema

0:('ID', 'string')
9:('SomeBooleanColumn', 'boolean')

Output

+---+-----------------+
| ID|SomeBooleanColumn|
+---+-----------------+
|  1|             true|
+---+-----------------+

+---+-----------------+
| ID|SomeBooleanColumn|
+---+-----------------+
|  2|            false|
+---+-----------------+

Using a string to filter for a boolean datatype column will work.

String DataType Column Filter with Boolean

some_dataframe: DataFrame = ss.createDataFrame([
    Row('1', 'true'),
    Row('2', 'false'),
    ], ['ID', 'SomeStringColumn']
)

print(some_dataframe.dtypes)

a= some_dataframe.where(f.col('SomeStringColumn') == True)
a.show()

b = some_dataframe.where(f.col('SomeStringColumn') == False)
b.show()

Schema

0:('ID', 'string')
9:('SomeStringColumn', 'string')

Output

+---+-----------------+
| ID|SomeBooleanColumn|
+---+-----------------+
|  1|             true|
+---+-----------------+

+---+-----------------+
| ID|SomeBooleanColumn|
+---+-----------------+
|  2|            false|
+---+-----------------+

Using a boolean to filter for a string datatype column will work.

Conclusion

Quick googling didn't provide much around this area and this is easy enough to verify. Pyspark seems flexible with filtering via string or boolean datatypes columns with bool or string.

True == 'True' == 'true'

False == 'False' == 'false'