b-ryce
b-ryce

Reputation: 5828

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.

table_a:

+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+

table_b:

+---+
|BID|
+---+
| 1 |
| 2 |
+---+

In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:

+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+

Here is what I'm trying to do

result_table = table_a.filter(table_b.BID.contains(table_a.AID))

But this doesn't seem to be working. It looks like I'm getting ALL values.

NOTE: I can't add any other imports other than pyspark.sql.functions import col

Upvotes: 2

Views: 3260

Answers (3)

Neshy
Neshy

Reputation: 39

This should work too:

table_a.where( col(AID).isin(table_b.BID.tolist() ) )

Upvotes: 0

dsk
dsk

Reputation: 2003

In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -

Create the Dataframe

df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
|   1| bar|
|   2| bar|
|   3| bar|
|   4| bar|
+----+----+

+---+---+
| id|val|
+---+---+
|  1|  1|
|  1|  2|
+---+---+

get all the unique values of val column in dataframe two and take in a set/list variable

df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
|   1| bar|       1|
|   2| bar|       1|
+----+----+--------+

Upvotes: 0

Cena
Cena

Reputation: 3419

You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.

result_table  = table_a.join(table_b, (table_a.AID == table_b.BID), \
                        how = "left_semi").drop("BID")

result_table.show()
+---+---+
|AID|foo|
+---+---+
|  1|bar|
|  2|bar|
+---+---+

Upvotes: 3

Related Questions