How to filter using concatenation of two columns in pyspark

Question

I have read in a parquet file and I would like to filter rows using prepared dict. There are two columns in the dataframe called col1 and col2 which are of type string. My dictionary has a set of strings in it and I want rows where the concatenation of the strings in columns col1 and col2 are in the dictionary. I tried

df.filter((df['col1']+df['col2']) in my_dict)

but it seems that df['col1']+df['col2'] is not a string even though that is the type of columns.

I also tried

df.filter(lambda x: (x['col1']+df['col2']) in my_dict)

What's the right way to do this?

desertnaut · Accepted Answer

So, there are 2 components in your issue:

The string column concatenation
The filtering using a dictionary

Regarding the first part - here is an example of string column concatenation using a toy dataframe:

spark.version
# u'2.1.1'

from pyspark.sql.functions import concat, col, lit
df = spark.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.show()
# +---+---+
# |  k|  v|
# +---+---+
# |foo|  1|
# |bar|  2|
# +---+---+

df2 = df.select(concat(col("k"), lit(" "), col("v")).alias('joined_colname'))
df2.show()
# +--------------+ 
# |joined_colname|
# +--------------+
# |         foo 1|
# |         bar 2|
# +--------------+

Regarding the second part, you need the .isin method - not sure it will work with dictionaries, but it definitely works with lists (['foo 1', 'foo 2']) or sets ({'foo 1', 'foo 2'}):

df2.filter(col('joined_colname').isin({'foo 1', 'foo 2'})).show() # works with lists, too
# +--------------+
# |joined_colname| 
# +--------------+
# |         foo 1|
# +--------------+

Hope this is helpful enough...

EDIT (after comment): to keep the joined column together with the columns of your initial df:

df3 = df.withColumn('joined_colname', concat(col("k"), lit(" "), col("v")))
df3.show()
# +---+---+--------------+ 
# |  k|  v|joined_colname| 
# +---+---+--------------+
# |foo|  1|         foo 1|
# |bar|  2|         bar 2| 
# +---+---+--------------+

How to filter using concatenation of two columns in pyspark

Answers (1)

Related Questions