There was some uncertainty on whether or not the following line actually interprets returns back a dataframe that has the correct value. All testing below conducted with Pyspark 3.1.1:
df = some_dataframe.where(f.col('SomeColumn') == 'true')
If the value of SomeColumn
was string
datatype, I would expect it to return a dataframe and row matching true
. This is indeed the case.
If the value of SomeColumn
was boolean
datatype, I would expect it to possibly follow the Python interpretation which is any non-null or non-empty string as True
otherwise False
. This is not the case.
Boolean DataType Column Filter with String
some_dataframe: DataFrame = ss.createDataFrame([
Row('1', True),
Row('2', False),
], ['ID', 'SomeBooleanColumn']
)
print(some_dataframe.dtypes)
a= some_dataframe.where(f.col('SomeBooleanColumn') == 'true')
a.show()
b = some_dataframe.where(f.col('SomeBooleanColumn') == 'false')
b.show()
Schema
0:('ID', 'string')
9:('SomeBooleanColumn', 'boolean')
Output
+---+-----------------+
| ID|SomeBooleanColumn|
+---+-----------------+
| 1| true|
+---+-----------------+
+---+-----------------+
| ID|SomeBooleanColumn|
+---+-----------------+
| 2| false|
+---+-----------------+
Using a string to filter for a boolean datatype column will work.
String DataType Column Filter with Boolean
some_dataframe: DataFrame = ss.createDataFrame([
Row('1', 'true'),
Row('2', 'false'),
], ['ID', 'SomeStringColumn']
)
print(some_dataframe.dtypes)
a= some_dataframe.where(f.col('SomeStringColumn') == True)
a.show()
b = some_dataframe.where(f.col('SomeStringColumn') == False)
b.show()
Schema
0:('ID', 'string')
9:('SomeStringColumn', 'string')
Output
+---+-----------------+
| ID|SomeBooleanColumn|
+---+-----------------+
| 1| true|
+---+-----------------+
+---+-----------------+
| ID|SomeBooleanColumn|
+---+-----------------+
| 2| false|
+---+-----------------+
Using a boolean to filter for a string datatype column will work.
Conclusion
Quick googling didn't provide much around this area and this is easy enough to verify. Pyspark seems flexible with filtering via string or boolean datatypes columns with bool or string.
True == 'True' == 'true'
False == 'False' == 'false'