Reputation: 894
I have below two dataframes and want to populate the final dataframe by joining two two input dataframes.
df1 i.e table1
id | code | name | location | val | date
1000 | 1 | 'A' | 'AABB' | 1 | 2021-01-01
1000 | 2 | 'B' | 'BBCC' | 3 | 2021-01-01
1000 | 3 | 'C' | 'CCDD' | 4 | 2021-01-01
1000 | 4 | 'D' | 'DDEE' | 1 | 2021-01-01
2000 | 1 | 'E' | 'EEFF' | 5 | 2021-03-01
2000 | 2 | 'F' | 'XXYY' | 4 | 2021-03-01
2000 | 3 | 'G' | 'YYZZ' | 2 | 2021-03-01
2000 | 4 | 'H' | 'ZZAA' | 1 | 2021-03-01
2000 | 4 | 'I' | 'IIII' | 1 | 2021-03-01
df1.createOrReplaceTempView('df1')
df2 i.e table2
id | city | dist | state | count | tot_sum | date
1000 | null | null | null | null | null | 2021-01-01
2000 | null | null | null | null | null | 2021-03-01
df2.createOrReplaceTempView('df2')
df3 i.e table3
id | city | dist | state | count | tot_sum | date
1000 | 'AABB' | 'BBCC' | 'CCDD' | 1 | 9 | 2021-01-01
2000 | 'EEFF' | 'XXYY' | 'YYZZ' | 2 | 13 | 2021-03-01
Logic:
when code =1 then consider location as city
when code =2 then consider location as dist
when code =3 then consider location as state
when code =4 then count the total number of records for that code for that id i.e in case of id 1000 we have only one record with code 4, in case of id 2000 we have 2 records
with code 4 sum of all vals for that id is the tot_sum i.e for id 1000 it will be 1+3+4+1=9, for id 2000 it will be 5+4+2+1+1=13
trying something like below however it didn't work
select d2.id as id,
d2.date as date,
CASE WHEN d1.code=1 then d1.location else null end as city,
CASE WHEN d1.code=2 then d1.location else null end as dist,
CASE WHEN d1.code=3 then d1.location else null end as state
FROM df1 d1 join df2 d2 on d1.id=d2.id
select d2.id,
d2.date
CASE WHEN d1.code=1 then state=d1.location,
CASE WHEN d1.code=2 then dist=d1.location,
CASE WHEN d1.code=3 then CityName=d1.location
FROM df1 d1 join df2 d2 on d1.id=d2.id
Any suggestions?
Note: Looking for a SQL Query(considering two input tables)/Pyspark dataframes/Pandas dataframes
DF1:
Upvotes: 1
Views: 79
Reputation: 15258
I do not really understand why you want to join with df2. All your data are already in df1, that's just an aggregation.
select * from df2 inner join (
select id,
first(case when code=1 then location end) as city,
first(case when code=2 then location end) as dist,
first(case when code=3 then location end) as state,
count(case when code=4 then 1 end) as count,
sum(val) as tot_sum,
date
from df1
group by id, date
) t on t.id = df2.id
from pyspark.sql import functions as F
df1 = df1.groupBy("id", "date").agg(
F.first(F.when(F.col("code") == 1, F.col("location")),ignorenulls=True).alias("city"),
F.first(F.when(F.col("code") == 2, F.col("location")),ignorenulls=True).alias("dist"),
F.first(F.when(F.col("code") == 3, F.col("location")),ignorenulls=True).alias("state"),
F.count(F.when(F.col("code") == 4, F.col("code"))).alias("count"),
F.sum(F.col("val")).alias("tot_sum"),
)
df3 = df1.join(df2, on='id')
Upvotes: 3
Reputation: 1269773
I think you just want aggregation:
select d2.id as id,
d2.date as date,
max(CASE WHEN d1.code=1 then d1.location end) as city,
max(CASE WHEN d1.code=2 then d1.location end) as dist,
max(CASE WHEN d1.code=3 then d1.location end) as state,
sum(CASE WHEN d1.code=4 then d1.val else 0 end) as cnt
from df1 d1 join
df2 d2
on d1.id=d2.id
group by d2.id, d2.date;
Upvotes: 2
Reputation: 1026
From what I understand of the question, you wish to join df1 and df2 together.
Depending on which column you wish to use for this purpose (I am assuming ID), you can implement as follows in SQL:
select t1.id, t1.code, t1.name,
t1.location, t1.val, t1.date,
t2.city, t2.dist, t2.state,
t2.count, t2.tot_sum from df1 as t1
inner join df2 as t2 on t1.id=t2.id;
Upvotes: 1