Reputation: 3
I have two Pyspark df's
df1
TransID Date custusername
1 11/01 1A
2 11/01 1A
3 11/02 1A
4 11/02 1A
5 11/03 1A
df2
custusername Date CustID
1A 11/01 xx1
1A 11/02 xx1
1A 11/03 xx2
Desired output after joining two dataframes and counting
Date CustID Count
11/01 xx1 2
11/02 xx1 2
11/03 xx2 1
Actual output i get is
11/01 xx1 2
11/01 xx2 2
11/02 xx1 2
11/02 xx2 2
11/03 xx1 1
11/03 xx2 1
Because the CustID updated on 11/03, my count is duplicating.
my code
join = [df1.custusername == df2.custusername]
joined = df1.join(df2, join, "inner")
Upvotes: 0
Views: 44
Reputation: 541
Having two DataFrames:
df1 = spark.createDataFrame([
(1, "11/01", "1A"),
(2, "11/01", "1A"),
(3, "11/02", "1A"),
(4, "11/02", "1A"),
(5, "11/03", "1A"),
], schema=['TransId', 'Date', 'custusername'])
df1.show()
+-------+-----+------------+
|TransId| Date|custusername|
+-------+-----+------------+
| 1|11/01| 1A|
| 2|11/01| 1A|
| 3|11/02| 1A|
| 4|11/02| 1A|
| 5|11/03| 1A|
+-------+-----+------------+
df2 = spark.createDataFrame([
("1A", "11/01", "xx1"),
("1A", "11/02", "xx1"),
("1A", "11/03", "xx2"),
], schema=['custusername', 'Date', 'CustId'])
df2.show()
+------------+-----+------+
|custusername| Date|CustId|
+------------+-----+------+
| 1A|11/01| xx1|
| 1A|11/02| xx1|
| 1A|11/03| xx2|
+------------+-----+------+
I'll group first DataFrame by Date
and custusername
.
df1_group = df1.groupBy('Date', 'custusername').count()
df1_group.show()
+-----+------------+-----+
| Date|custusername|count|
+-----+------------+-----+
|11/01| 1A| 2|
|11/03| 1A| 1|
|11/02| 1A| 2|
+-----+------------+-----+
And then simply join df2
df = df1_group.join(df2, on=['custusername', 'Date'], how='left')
df.show()
+------------+-----+-----+------+
|custusername| Date|count|CustId|
+------------+-----+-----+------+
| 1A|11/01| 2| xx1|
| 1A|11/03| 1| xx2|
| 1A|11/02| 2| xx1|
+------------+-----+-----+------+
Upvotes: 0