Reputation: 168
I have two dataframes. The first dataframe contains just one column business_contact_nr
, which is a set of client numbers.
| business_contact_nr |
34567
45678
The second dataframe contains multiple columns, bc
containing client numbers and the other columns contain information about these clients.
| bc | gender | savings | month |
34567 1 100 200512
34567 1 200 200601
45678 0 500 200512
45678 0 500 200601
01234 1 60 200512
01234 1 150 200601
What I would like to do is subset the second dataframe based on if the client numbers in it match with the ones in the first dataframe.
So all client numbers that are not also in the first dataframe should be deleted, in this case all rows where bc = 01234
.
I am working with Pyspark 1.6.1. Any idea on how to do this?
Upvotes: 3
Views: 1466
Reputation: 5660
This can be solved by join
. Assume df1
is your first dataframe and df2
is your second dataframe. Then you can first rename df1.business_contact_nr
and join
:
df1 = df1.withColumnRenamed('business_contact_nr', 'bc')
df2subset = df2.join(df1, on='bc')
Upvotes: 2