Tokyo
Tokyo

Reputation: 833

How to keep only certain df columns that appear as rows in another df

Assuming that we have two dataframes

df_1

+--------+--------+-------+-------+
| id     | col1   | col2  | col3  |
+--------+--------+-------+-------+
| A      |   10   |   5   |   4   |
| B      |   5    |   3   |   2   |
+--------+--------+-------+-------+

and df_2

+----------+--------+---------+
| col_name | col_t  | col_d   |
+----------+--------+---------+
| col1     |   3.3  |   2.2   |
| col3     |   1    |   2     |
+----------+--------+---------+

What I want to achieve is to join the two tables, such that only the columns that appear under df_2's col_name are kept in df_1 i.e. the desired table would be

+--------+--------+-------+
| id     | col1   | col3  |
+--------+--------+-------+
| A      |   10   |   4   |
| B      |   5    |   2   |
+--------+--------+-------+

however, I need to perform this action only through joins and/or df transpose or pivot if possible.

I know that the above could be easily inferred by just selecting the df_1 columns as they appear in df_2's col_name but this is not what I am looking for here

Upvotes: 0

Views: 78

Answers (1)

Cena
Cena

Reputation: 3419

One way to do this is to dedup and obtain the values in df_2.col_name using collect_list and passing this list of column names in your df_1 dataframe:

col_list = list(set(df_2.select(collect_list("col_name")).collect()[0][0]))
list_with_id = ['id'] + col_list

df_1[list_with_id].show()

Output:

+---+----+----+
| id|col1|col3|
+---+----+----+
|  A|  10|   4|
|  B|   5|   2|
+---+----+----+

Is this what you're looking for? (Assuming you want something dynamic and not manually selecting columns). I'm not using joins or pivots here though.

Upvotes: 1

Related Questions