BryC
BryC

Reputation: 113

use a loop inside a filter function

I have this dataframes and list

list1 = [(1, 'abc',234),(2,'eds',122),(3,'rte',565),(5,'asd',999),(10,'weq',90),(11,'dcd',98), 
(11,'dcd',98)]
list2 = [(2, None,455),(3,'xyz',565),(4,'wer',234),(8,'ioo',878),(10,'weq',90),(11,'dcd',95), 
(11,'sd',91),(11,'dd',9812)]
df1 = spark.createDataFrame(list1, ['id', 'value','number'])
df2 = spark.createDataFrame(list2, ['id', 'value','number'])

cols_eq = [[10, 11, 11], ['weq', 'dcd', 'dcd']]

my problem is in the next line

df_dif=df2.unionAll(df1).filter(~(col('id').isin(cols_eq[0]) & col('value').isin(cols_eq[1])))

specifically inside the filter, as you can see I am filtering the columns I am interested in by typing them one by one, but what I need is a way to do it without typing one by one. I tried to create a for with the columns I'm interested in (id, value in this case), but I couldn't come up with a way to incorporate the "isin()" correctly. So, if you can think of a way to do a for or maybe a pyspark function that will do what I need, I would appreciate it.

Upvotes: 0

Views: 329

Answers (1)

ScootCork
ScootCork

Reputation: 3686

You can do so by building up your filter in a loop. You'll need a list of the relevant columns you want to apply each item of cols_eq to. If they are the first n columns of your dataframe you could do something like cols = df1.columns[:n]

import pyspark.sql.functions as f 

cols = ['id', 'value']
cols_eq = [[10, 11, 11], ['weq', 'dcd', 'dcd']]

fil = f.lit(True)
for column, vals in zip(cols, cols_eq):
    fil = fil & f.col(column).isin(vals)

df2.unionAll(df1).filter(~fil).show()

+---+-----+------+
| id|value|number|
+---+-----+------+
|  2| null|   455|
|  3|  xyz|   565|
|  4|  wer|   234|
|  8|  ioo|   878|
| 11|   sd|    91|
| 11|   dd|  9812|
|  1|  abc|   234|
|  2|  eds|   122|
|  3|  rte|   565|
|  5|  asd|   999|
+---+-----+------+

Upvotes: 1

Related Questions