Reputation: 3

Pyspark duplications after joining two dataframes

I have two Pyspark df's

df1

TransID  Date     custusername
1        11/01      1A
2        11/01      1A
3        11/02      1A
4        11/02      1A
5        11/03      1A

df2

custusername   Date    CustID
1A             11/01    xx1
1A             11/02    xx1
1A             11/03    xx2

Desired output after joining two dataframes and counting

Date   CustID   Count
11/01   xx1      2
11/02   xx1      2
11/03   xx2      1

Actual output i get is

11/01   xx1      2
11/01   xx2      2
11/02   xx1      2
11/02   xx2      2
11/03   xx1      1
11/03   xx2      1

Because the CustID updated on 11/03, my count is duplicating.

my code

join = [df1.custusername == df2.custusername]
joined = df1.join(df2, join, "inner")

Upvotes: 0

Answers (1)

lukaszKielar

Reputation: 541

Having two DataFrames:

df1 = spark.createDataFrame([
    (1, "11/01", "1A"),
    (2, "11/01", "1A"),
    (3, "11/02", "1A"),
    (4, "11/02", "1A"),
    (5, "11/03", "1A"),
], schema=['TransId', 'Date', 'custusername'])
df1.show()

+-------+-----+------------+
|TransId| Date|custusername|
+-------+-----+------------+
|      1|11/01|          1A|
|      2|11/01|          1A|
|      3|11/02|          1A|
|      4|11/02|          1A|
|      5|11/03|          1A|
+-------+-----+------------+

df2 = spark.createDataFrame([
    ("1A", "11/01", "xx1"),
    ("1A", "11/02", "xx1"),
    ("1A", "11/03", "xx2"),
], schema=['custusername', 'Date', 'CustId'])
df2.show()

+------------+-----+------+
|custusername| Date|CustId|
+------------+-----+------+
|          1A|11/01|   xx1|
|          1A|11/02|   xx1|
|          1A|11/03|   xx2|
+------------+-----+------+

I'll group first DataFrame by Date and custusername.

df1_group = df1.groupBy('Date', 'custusername').count()
df1_group.show()

+-----+------------+-----+
| Date|custusername|count|
+-----+------------+-----+
|11/01|          1A|    2|
|11/03|          1A|    1|
|11/02|          1A|    2|
+-----+------------+-----+

And then simply join df2

df = df1_group.join(df2, on=['custusername', 'Date'], how='left')
df.show()

+------------+-----+-----+------+
|custusername| Date|count|CustId|
+------------+-----+-----+------+
|          1A|11/01|    2|   xx1|
|          1A|11/03|    1|   xx2|
|          1A|11/02|    2|   xx1|
+------------+-----+-----+------+

Upvotes: 0

Pyspark duplications after joining two dataframes

Answers (1)

Related Questions