Hrvoje
Hrvoje

Reputation: 15182

Python pandas, merge multiple columns in one table with single column in another table

Suppose I have table like this:

+----------+------------+----------+------------+----------+------------+-------+
| a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count |
+----------+------------+----------+------------+----------+------------+-------+
| country  | 1          | NAN      | NAN        | NAN      | NAN        | 100   |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 2          | city     | NAN        | NAN      | NAN        | 20    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 2          | city     | NAN        | NAN      | NAN        | 80    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 3          | age      | 4          | sex      | 6          | 40    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 3          | age      | 5          | sex      | 7          | 60    |
+----------+------------+----------+------------+----------+------------+-------+

and I want to LEFT JOIN it with following table on a_name column in panadas:

+----+---------+-------+-------+-------+
| id | a_name  | c01   | c02   | c03   |
+----+---------+-------+-------+-------+
| 1  | country | dtr1  | dtr2  | dtr3  |
+----+---------+-------+-------+-------+
| 2  | region  | dtc1  | dtc2  | dtc3  |
+----+---------+-------+-------+-------+
| 3  | city    | dta1  | dta2  | dta3  |
+----+---------+-------+-------+-------+
| 4  | age     | dtCo1 | dtCo2 | dtCo3 |
+----+---------+-------+-------+-------+
| 5  | sex     | dts1  | dts2  | dts3  |
+----+---------+-------+-------+-------+

I want to add columns c01, c02 and c03 to each value (country ,region, city, age,sex) appearing in columns a_name_0, a_name_1 and a_name_2 in first table.

Obviously I would need to add three new columns for each values that appear in a_name_0, a_name_1 and a_name_2 columns, otherwise my table would have different number of rows. Rest of row values should be empty, or NA or NAN..whatever.

Expected output:

+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| a_name_0 | c01_0 | c01_0 | c01_0 | id_qname_0 | a_name_1 | c01_1 | c01_1 | c01_1 | id_qname_1 | a_name_2 | c01_2 | c01_2 | c01_2 | id_qname_2 | count |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| country  | dtCo1 | dtCo2 | dtCo3 | 1          | NAN      | NAN   | NAN   | NAN   | NAN        | NAN      | NAN   | NAN   | NAN   | NAN        | 70    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   | dtr1  | dtr2  | dtr2  | 2          | city     | dtc1  | dtc2  | dtc3  | NAN        | NAN      | NAN   | NAN   | NAN   | NAN        | 20    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 2          | city     |       |       |       | NAN        | NAN      |       |       |       | NAN        | 20    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 3          | age      |       |       |       | 4          | sex      |       |       |       | 6          | 40    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 3          | age      |       |       |       | 5          | sex      |       |       |       | 7          | 60    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+

Explanation:

I'm building data warhouse table which will serve for data analytics purposes. Quotes table (1st table) should be populated with various projects quotes info (table 2) which needs to be visually represented.

Upvotes: 1

Views: 1694

Answers (2)

jezrael
jezrael

Reputation: 862751

Use:

#convert count column to index for possible processing all another cols by groups
df1 = df1.set_index('count')
#groups by last value after last _
c = df1.columns.str.rsplit('_').str[-1]
#removed unnecessary id column from df2
df2 = df2.drop('id', axis=1)

#for list of DataFrames
dfs = []
#iterate groups
for i, x in df1.groupby(c, axis=1):
    #change columns names for match and for avoid duplicated columns names
    df2.columns = [ f'a_name_{i}'] + (df2.columns + f'_{i}').tolist()[1:]
    #left join
    x = x.merge(df2, on=f'a_name_{i}', how='left')
    #convert duplicates by a_name columns to NaNs
    m = x.duplicated(subset=[x.columns[0]])
    x.iloc[m.to_numpy(), 2:] = np.nan
    #convert id_qname columns to end
    x[f'id_qname_{i}'] = x.pop(f'id_qname_{i}')
    #append to list
    dfs.append(x)

#join together and last add count column from index 
df = pd.concat(dfs, axis=1).assign(count=df1.index)

print (df)
  a_name_0 c01_0 c02_0 c03_0  id_qname_0 a_name_1 c01_0_1 c02_0_1 c03_0_1  \
0  country  dtr1  dtr2  dtr3           1      NaN     NaN     NaN     NaN   
1   region  dtc1  dtc2  dtc3           2     city    dta1    dta2    dta3   
2   region   NaN   NaN   NaN           2     city     NaN     NaN     NaN   
3   region   NaN   NaN   NaN           3      age   dtCo1   dtCo2   dtCo3   
4   region   NaN   NaN   NaN           3      age     NaN     NaN     NaN   

   id_qname_1 a_name_2 c01_0_1_2 c02_0_1_2 c03_0_1_2  id_qname_2  count  
0         NaN      NaN       NaN       NaN       NaN         NaN    100  
1         NaN      NaN       NaN       NaN       NaN         NaN     20  
2         NaN      NaN       NaN       NaN       NaN         NaN     80  
3         4.0      sex      dts1      dts2      dts3         6.0     40  
4         5.0      sex       NaN       NaN       NaN         7.0     60  

Upvotes: 1

skillsmuggler
skillsmuggler

Reputation: 1902

Merge the dataframes using Outer joins and specify the columns (from each dataframe) on which you want join the tables.

# Sample data
>>> A
  name_1 name_2  values
0      a      b       1
1      b      c       2
2      c      b       3
3      d      a       4

>>> B
  name  values
0    a       1
1    b       2
2    c       3

>>> C
  name  values
0    a      10
1    b      20
2    c      30

Making use of the merge() method, you can specify the columns you wan to merge the dataframes on. Setting the how parameter to outer specifies an outer join, this fills the no matched data points with NaN.

# Merging
>>> merge1 = A.merge(B, left_on='name_1', right_on='name', how='outer')
>>> merge1
  name_1 name_2  values_x name  values_y
0      a      b         1    a       1.0
1      b      c         2    b       2.0
2      c      b         3    c       3.0
3      d      a         4  NaN       NaN

>>> merge = merge1.merge(C, left_on='name_2', right_on='name', how='outer')
>>> merge
  name_1 name_2  values_x name_x  values_y name_y  values
0      a      b         1      a       1.0      b      20
1      c      b         3      c       3.0      b      20
2      b      c         2      b       2.0      c      30
3      d      a         4    NaN       NaN      a      10

Upvotes: 1

Related Questions