Reputation: 894

Spark/Pandas/SQL table Joins

I have below two dataframes and want to populate the final dataframe by joining two two input dataframes.

df1 i.e table1

id    | code | name | location  | val | date
1000  | 1    | 'A'  | 'AABB'    | 1   | 2021-01-01
1000  | 2    | 'B'  | 'BBCC'    | 3   | 2021-01-01
1000  | 3    | 'C'  | 'CCDD'    | 4   | 2021-01-01
1000  | 4    | 'D'  | 'DDEE'    | 1   | 2021-01-01
2000  | 1    | 'E'  | 'EEFF'    | 5   | 2021-03-01
2000  | 2    | 'F'  | 'XXYY'    | 4   | 2021-03-01
2000  | 3    | 'G'  | 'YYZZ'    | 2   | 2021-03-01
2000  | 4    | 'H'  | 'ZZAA'    | 1   | 2021-03-01
2000  | 4    | 'I'  | 'IIII'    | 1   | 2021-03-01

df1.createOrReplaceTempView('df1')

df2 i.e table2

id   | city | dist | state | count | tot_sum | date

1000 | null | null | null  | null  | null    | 2021-01-01
2000 | null | null | null  | null  | null    | 2021-03-01

df2.createOrReplaceTempView('df2')

df3 i.e table3

id   | city   | dist   | state  | count | tot_sum | date
1000 | 'AABB' | 'BBCC' | 'CCDD' | 1     |  9      | 2021-01-01
2000 | 'EEFF' | 'XXYY' | 'YYZZ' | 2     |  13     | 2021-03-01

Logic:

when code =1 then consider location as city

when code =2 then consider location as dist

when code =3 then consider location as state

when code =4 then count the total number of records for that code for that id i.e in case of id 1000 we have only one record with code 4, in case of id 2000 we have 2 records

with code 4 sum of all vals for that id is the tot_sum i.e for id 1000 it will be 1+3+4+1=9, for id 2000 it will be 5+4+2+1+1=13

trying something like below however it didn't work


select d2.id as id,
       d2.date as date,
       CASE WHEN d1.code=1 then d1.location else null end as city,
       CASE WHEN d1.code=2 then d1.location else null end as dist,
       CASE WHEN d1.code=3 then d1.location else null end as state
FROM   df1 d1 join df2 d2 on d1.id=d2.id 




select d2.id,
       d2.date
       CASE WHEN d1.code=1 then state=d1.location,
       CASE WHEN d1.code=2 then dist=d1.location,
       CASE WHEN d1.code=3 then CityName=d1.location
FROM   df1 d1 join df2 d2 on d1.id=d2.id

Any suggestions?

Note: Looking for a SQL Query(considering two input tables)/Pyspark dataframes/Pandas dataframes

DF1:

Upvotes: 1

Answers (3)

Steven

Reputation: 15258

I do not really understand why you want to join with df2. All your data are already in df1, that's just an aggregation.

Using SQL

select * from df2 inner join (
select id,
       first(case when code=1 then location end) as city,
       first(case when code=2 then location end) as dist,
       first(case when code=3 then location end) as state,
       count(case when code=4 then 1 end) as count,
       sum(val) as tot_sum,
       date
from df1 
group by id, date
) t on t.id = df2.id

Using Pyspark

from pyspark.sql import functions as F

df1 = df1.groupBy("id", "date").agg(
    F.first(F.when(F.col("code") == 1, F.col("location")),ignorenulls=True).alias("city"),
    F.first(F.when(F.col("code") == 2, F.col("location")),ignorenulls=True).alias("dist"),
    F.first(F.when(F.col("code") == 3, F.col("location")),ignorenulls=True).alias("state"),
    F.count(F.when(F.col("code") == 4, F.col("code"))).alias("count"),
    F.sum(F.col("val")).alias("tot_sum"),
)

df3 = df1.join(df2, on='id')

Upvotes: 3

Gordon Linoff

Reputation: 1269773

I think you just want aggregation:

select d2.id as id,
       d2.date as date,
       max(CASE WHEN d1.code=1 then d1.location end) as city,
       max(CASE WHEN d1.code=2 then d1.location end) as dist,
       max(CASE WHEN d1.code=3 then d1.location end) as state,
       sum(CASE WHEN d1.code=4 then d1.val else 0 end) as cnt
from df1 d1 join
     df2 d2
     on d1.id=d2.id 
group by d2.id, d2.date;

Upvotes: 2

Michael Grogan

Reputation: 1026

From what I understand of the question, you wish to join df1 and df2 together.

Depending on which column you wish to use for this purpose (I am assuming ID), you can implement as follows in SQL:

select t1.id, t1.code, t1.name,
t1.location, t1.val, t1.date, 
t2.city, t2.dist, t2.state, 
t2.count, t2.tot_sum from df1 as t1 
inner join df2 as t2 on t1.id=t2.id;

Upvotes: 1

Spark/Pandas/SQL table Joins

Answers (3)

Using SQL

Using Pyspark

Related Questions