Reputation: 15182
Suppose I have table like this:
+----------+------------+----------+------------+----------+------------+-------+
| a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count |
+----------+------------+----------+------------+----------+------------+-------+
| country | 1 | NAN | NAN | NAN | NAN | 100 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | NAN | NAN | NAN | 20 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | NAN | NAN | NAN | 80 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 4 | sex | 6 | 40 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 5 | sex | 7 | 60 |
+----------+------------+----------+------------+----------+------------+-------+
and I want to LEFT JOIN
it with following table on a_name
column in panadas:
+----+---------+-------+-------+-------+
| id | a_name | c01 | c02 | c03 |
+----+---------+-------+-------+-------+
| 1 | country | dtr1 | dtr2 | dtr3 |
+----+---------+-------+-------+-------+
| 2 | region | dtc1 | dtc2 | dtc3 |
+----+---------+-------+-------+-------+
| 3 | city | dta1 | dta2 | dta3 |
+----+---------+-------+-------+-------+
| 4 | age | dtCo1 | dtCo2 | dtCo3 |
+----+---------+-------+-------+-------+
| 5 | sex | dts1 | dts2 | dts3 |
+----+---------+-------+-------+-------+
I want to add columns c01, c02 and c03
to each value (country ,region, city, age,sex
) appearing in columns a_name_0, a_name_1 and a_name_2
in first table.
Obviously I would need to add three new columns for each values that appear in a_name_0, a_name_1 and a_name_2
columns, otherwise my table would have different number of rows. Rest of row values should be empty, or NA or NAN..whatever.
Expected output:
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| a_name_0 | c01_0 | c01_0 | c01_0 | id_qname_0 | a_name_1 | c01_1 | c01_1 | c01_1 | id_qname_1 | a_name_2 | c01_2 | c01_2 | c01_2 | id_qname_2 | count |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| country | dtCo1 | dtCo2 | dtCo3 | 1 | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | 70 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | dtr1 | dtr2 | dtr2 | 2 | city | dtc1 | dtc2 | dtc3 | NAN | NAN | NAN | NAN | NAN | NAN | 20 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 2 | city | | | | NAN | NAN | | | | NAN | 20 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 3 | age | | | | 4 | sex | | | | 6 | 40 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 3 | age | | | | 5 | sex | | | | 7 | 60 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
Explanation:
I'm building data warhouse table which will serve for data analytics purposes. Quotes table (1st table) should be populated with various projects quotes info (table 2) which needs to be visually represented.
Upvotes: 1
Views: 1694
Reputation: 862751
Use:
#convert count column to index for possible processing all another cols by groups
df1 = df1.set_index('count')
#groups by last value after last _
c = df1.columns.str.rsplit('_').str[-1]
#removed unnecessary id column from df2
df2 = df2.drop('id', axis=1)
#for list of DataFrames
dfs = []
#iterate groups
for i, x in df1.groupby(c, axis=1):
#change columns names for match and for avoid duplicated columns names
df2.columns = [ f'a_name_{i}'] + (df2.columns + f'_{i}').tolist()[1:]
#left join
x = x.merge(df2, on=f'a_name_{i}', how='left')
#convert duplicates by a_name columns to NaNs
m = x.duplicated(subset=[x.columns[0]])
x.iloc[m.to_numpy(), 2:] = np.nan
#convert id_qname columns to end
x[f'id_qname_{i}'] = x.pop(f'id_qname_{i}')
#append to list
dfs.append(x)
#join together and last add count column from index
df = pd.concat(dfs, axis=1).assign(count=df1.index)
print (df)
a_name_0 c01_0 c02_0 c03_0 id_qname_0 a_name_1 c01_0_1 c02_0_1 c03_0_1 \
0 country dtr1 dtr2 dtr3 1 NaN NaN NaN NaN
1 region dtc1 dtc2 dtc3 2 city dta1 dta2 dta3
2 region NaN NaN NaN 2 city NaN NaN NaN
3 region NaN NaN NaN 3 age dtCo1 dtCo2 dtCo3
4 region NaN NaN NaN 3 age NaN NaN NaN
id_qname_1 a_name_2 c01_0_1_2 c02_0_1_2 c03_0_1_2 id_qname_2 count
0 NaN NaN NaN NaN NaN NaN 100
1 NaN NaN NaN NaN NaN NaN 20
2 NaN NaN NaN NaN NaN NaN 80
3 4.0 sex dts1 dts2 dts3 6.0 40
4 5.0 sex NaN NaN NaN 7.0 60
Upvotes: 1
Reputation: 1902
Merge the dataframes using Outer joins and specify the columns (from each dataframe) on which you want join the tables.
# Sample data
>>> A
name_1 name_2 values
0 a b 1
1 b c 2
2 c b 3
3 d a 4
>>> B
name values
0 a 1
1 b 2
2 c 3
>>> C
name values
0 a 10
1 b 20
2 c 30
Making use of the merge()
method, you can specify the columns you wan to merge the dataframes on. Setting the how
parameter to outer
specifies an outer join, this fills the no matched data points with NaN
.
# Merging
>>> merge1 = A.merge(B, left_on='name_1', right_on='name', how='outer')
>>> merge1
name_1 name_2 values_x name values_y
0 a b 1 a 1.0
1 b c 2 b 2.0
2 c b 3 c 3.0
3 d a 4 NaN NaN
>>> merge = merge1.merge(C, left_on='name_2', right_on='name', how='outer')
>>> merge
name_1 name_2 values_x name_x values_y name_y values
0 a b 1 a 1.0 b 20
1 c b 3 c 3.0 b 20
2 b c 2 b 2.0 c 30
3 d a 4 NaN NaN a 10
Upvotes: 1