R0bert
R0bert

Reputation: 557

Intersection of two data frames with different columns in Pyspark

I am new to Data Science and I am working on a simple self project using Google Colab. I took a data from a something1.csv file and something2.csv file.

df1 = spark.read.csv('something1.csv', header=True)
df2 = spark.read.csv('something2.csv', header=True)

The data of something1.csv looks like

#+----------+---------+----------+--------+-------+
#|   country| Latitude| Longitude|   col1 |  col2 |
#+----------+---------+----------+--------+-------+
#|   Andorra|   42.506|    1.5218|    2   |   1   |
#| Australia|   -31.81|    115.86|    1   |   6   |
#|   Austria|   41.597|    12.580|    4   |   9   |
#|   Belgium|   21.782|     1.286|    2   |   3   |
#|     India|   78.389|    12.972|    1   |   7   |
#|   Ireland|    9.281|     9.286|    9   |   8   |
#|       USA|   69.371|    21.819|    7   |   2   |
#+----------+---------+----------+--------+-------+

The data of something2.csv looks like

#+----------+---------+----------+--------+-------+
#|   country| Latitude| Longitude|   col1 |  col2 |
#+----------+---------+----------+--------+-------+
#| Australia|   -31.81|    115.86|    2   |   6   |
#|   Belgium|   21.782|     1.286|    1   |   6   |
#|     India|   78.389|    12.972|    3   |   5   |
#|       USA|   69.371|    21.819|    2   |   5   |
#+----------+---------+----------+--------+-------+

Now I want to intersect df2 with df1 based on Longitude and Latitude and get the rows that are present in df1 along with col1 and col2 from df1. My table should look like

#+----------+---------+----------+--------+-------+
#|   country| Latitude| Longitude|   col1 |  col2 |
#+----------+---------+----------+--------+-------+
#| Australia|   -31.81|    115.86|    1   |   6   |
#|   Belgium|   21.782|     1.286|    2   |   3   |
#|     India|   78.389|    12.972|    1   |   7   |
#|       USA|   69.371|    21.819|    7   |   2   |
#+----------+---------+----------+--------+-------+

I tried using the following code but didn't work.

new_df =  df1.intersect(df2) #using the intersection in pyspark which gave me null table

Then I also tried based on Latitude and Longitude

new_df = df2.select('Latitude','Longitude').intersect(df1.select('Latitude','Logitude')) #intersecting based on columns

I tried both the above methods in pyspark but didn't work.

Upvotes: 1

Views: 1725

Answers (1)

notNull
notNull

Reputation: 31490

Intersect only gets rows that are common in both dataframes.

  • But in your case you need col1,col2 from df1 and other columns from df2, Join the dataframes (left/inner as per requirement) and select only col1,col2 from df1 and other columns from df2.

  • (or) As mentioned in comments by Mohammad Murtaza Hashmi Use left_semi join

Example:

#using left semi join
df1.join(df2,['Latitude','Longitude'],'left_semi').show()

#using left join
df2.alias("t2").join(df1.alias("t1"),['Latitude','Longitude'],'left').select("t2.country","t2.Latitude","t2.Longitude","t1.col1","t1.col2").show()
#+---------+--------+---------+----+----+
#|  country|Latitude|Longitude|col1|col2|
#+---------+--------+---------+----+----+
#|Australia|  -31.81|   115.86|   1|   6|
#|  Belgium|  21.782|    1.286|   2|   3|
#|    India|  78.389|   12.972|   1|   7|
#|      USA|  69.371|   21.819|   7|   2|
#+---------+--------+---------+----+----+

Dynamic way:

#join columns
join_cols=[x for x in df1.columns if x.startswith("L")]

#selecting cols from t1
t1_cols=["t1."+x for x in df1.columns if x.startswith("col")]

#selecting cols from t2
t2_cols=["t2."+x for x in df2.columns if not x.startswith("col")]

df2.alias("t2").join(df1.alias("t1"),['Latitude','Longitude'],'inner').select(*t2_cols + t1_cols).show()

#+---------+--------+---------+----+----+
#|  country|Latitude|Longitude|col1|col2|
#+---------+--------+---------+----+----+
#|Australia|  -31.81|   115.86|   1|   6|
#|  Belgium|  21.782|    1.286|   2|   3|
#|    India|  78.389|   12.972|   1|   7|
#|      USA|  69.371|   21.819|   7|   2|
#+---------+--------+---------+----+----+

Upvotes: 3

Related Questions