Subset dataframe based on matching values in another dataframe Pyspark 1.6.1

Question

I have two dataframes. The first dataframe contains just one column business_contact_nr, which is a set of client numbers.

| business_contact_nr |
34567
45678

The second dataframe contains multiple columns, bc containing client numbers and the other columns contain information about these clients.

| bc     | gender  | savings | month |
34567     1         100       200512
34567     1         200       200601
45678     0         500       200512
45678     0         500       200601
01234     1         60        200512
01234     1         150       200601

What I would like to do is subset the second dataframe based on if the client numbers in it match with the ones in the first dataframe.

So all client numbers that are not also in the first dataframe should be deleted, in this case all rows where bc = 01234.

I am working with Pyspark 1.6.1. Any idea on how to do this?

Gaurav Bansal · Accepted Answer

This can be solved by join. Assume df1 is your first dataframe and df2 is your second dataframe. Then you can first rename df1.business_contact_nr and join:

df1 = df1.withColumnRenamed('business_contact_nr', 'bc')
df2subset = df2.join(df1, on='bc')

Subset dataframe based on matching values in another dataframe Pyspark 1.6.1

Answers (1)

Related Questions