Reputation: 113
I have this dataframes and list
list1 = [(1, 'abc',234),(2,'eds',122),(3,'rte',565),(5,'asd',999),(10,'weq',90),(11,'dcd',98),
(11,'dcd',98)]
list2 = [(2, None,455),(3,'xyz',565),(4,'wer',234),(8,'ioo',878),(10,'weq',90),(11,'dcd',95),
(11,'sd',91),(11,'dd',9812)]
df1 = spark.createDataFrame(list1, ['id', 'value','number'])
df2 = spark.createDataFrame(list2, ['id', 'value','number'])
cols_eq = [[10, 11, 11], ['weq', 'dcd', 'dcd']]
my problem is in the next line
df_dif=df2.unionAll(df1).filter(~(col('id').isin(cols_eq[0]) & col('value').isin(cols_eq[1])))
specifically inside the filter, as you can see I am filtering the columns I am interested in by typing them one by one, but what I need is a way to do it without typing one by one. I tried to create a for with the columns I'm interested in (id, value in this case), but I couldn't come up with a way to incorporate the "isin()" correctly. So, if you can think of a way to do a for or maybe a pyspark function that will do what I need, I would appreciate it.
Upvotes: 0
Views: 329
Reputation: 3686
You can do so by building up your filter in a loop. You'll need a list of the relevant columns you want to apply each item of cols_eq
to. If they are the first n columns of your dataframe you could do something like cols = df1.columns[:n]
import pyspark.sql.functions as f
cols = ['id', 'value']
cols_eq = [[10, 11, 11], ['weq', 'dcd', 'dcd']]
fil = f.lit(True)
for column, vals in zip(cols, cols_eq):
fil = fil & f.col(column).isin(vals)
df2.unionAll(df1).filter(~fil).show()
+---+-----+------+
| id|value|number|
+---+-----+------+
| 2| null| 455|
| 3| xyz| 565|
| 4| wer| 234|
| 8| ioo| 878|
| 11| sd| 91|
| 11| dd| 9812|
| 1| abc| 234|
| 2| eds| 122|
| 3| rte| 565|
| 5| asd| 999|
+---+-----+------+
Upvotes: 1