Paulo
Paulo

Reputation: 69

Pyspark create array to store three keys of a dataframe

I wanted to create an array and save three fields from a dataframe and then read that array so the codes stored in the array are not on another dataframe.

df1

id; id1; code; date_create
1; 100; 50; 2021-10-10
2; 200; 60; 2021-10-10
3; 300; 70; 2021-10-10
4; 400; 80; 2021-10-10
5; 500; 90; 2021-10-10

df2

1; 100; 50; 2021-10-10
2; 200; 60; 2021-10-10
3; 300; 70; 2021-10-10
4; 400; 80; 2021-10-15
5; 500; 90; 2021-10-15
6; 600; 100; 2021-10-15
7; 700; 101; 2021-10-15

I would like to store it in an array:

read df2 where date_create equals 2021-10-15 and save the field id, id1, code

After read the array and generate the df1 again but without the id, id1, code that is in the array

more or less like this, below the code is not right is more an idea

list = np.array (df1.select ("id", id1, code) .collect ())
    for i in lista:
          df1 = df1.filter (df1 ["id", id1, code]! = i)

Then I was going to make a union

df2.union (df1)

to avoid duplication problems.

If anyone can help me I would appreciate it.

result
    id; id1; code; date_create
    1; 100; 50; 2021-10-10
    2; 200; 60; 2021-10-10
    3; 300; 70; 2021-10-10
    4; 400; 80; 2021-10-15
    5; 500; 90; 2021-10-15
    6; 600; 100; 2021-10-15
    7; 700; 101; 2021-10-15

Upvotes: 0

Views: 27

Answers (1)

mck
mck

Reputation: 42342

You can do an anti-join to eliminate duplicates, and then union:

result = df1.join(df2, ['id', 'id1', 'code'], 'anti').union(df2)

Upvotes: 1

Related Questions