Reputation: 833
Assuming that we have two dataframes
df_1
+--------+--------+-------+-------+
| id | col1 | col2 | col3 |
+--------+--------+-------+-------+
| A | 10 | 5 | 4 |
| B | 5 | 3 | 2 |
+--------+--------+-------+-------+
and df_2
+----------+--------+---------+
| col_name | col_t | col_d |
+----------+--------+---------+
| col1 | 3.3 | 2.2 |
| col3 | 1 | 2 |
+----------+--------+---------+
What I want to achieve is to join the two tables, such that only the columns that appear under df_2
's col_name
are kept in df_1
i.e. the desired table would be
+--------+--------+-------+
| id | col1 | col3 |
+--------+--------+-------+
| A | 10 | 4 |
| B | 5 | 2 |
+--------+--------+-------+
however, I need to perform this action only through joins and/or df transpose or pivot if possible.
I know that the above could be easily inferred by just select
ing the df_1
columns as they appear in df_2
's col_name
but this is not what I am looking for here
Upvotes: 0
Views: 78
Reputation: 3419
One way to do this is to dedup and obtain the values in df_2.col_name
using collect_list
and passing this list of column names in your df_1
dataframe:
col_list = list(set(df_2.select(collect_list("col_name")).collect()[0][0]))
list_with_id = ['id'] + col_list
df_1[list_with_id].show()
Output:
+---+----+----+
| id|col1|col3|
+---+----+----+
| A| 10| 4|
| B| 5| 2|
+---+----+----+
Is this what you're looking for? (Assuming you want something dynamic and not manually selecting columns). I'm not using joins or pivots here though.
Upvotes: 1