TheDaJon
TheDaJon

Reputation: 565

pandas python add columns from other data

I want to make a df that in the end of the df, there are columns that gets their data from revious dfs. I want to check in other dfs, if the ids in the first columns exists there, if so, check in which columns they appear, and add it to the last column in the main df.

EXAMPLE:

I have a MAIN DF that looks like so:

names    col1   col2   col3   total
 bbb      V      V      X      2
 ccc      V      X      X      1
 zzz      X      V      V      2
 qqq      X      V      X      1
 rrr      X      X      V      1

and for example I have two more dfs (in general there are more then just two more dfs, so I want to run on all of them in a loop or something), DF1:

names    col1   col4   col5   total
 bbb      V      V      X      2
 ccc      V      X      X      1
 yyy      V      V      X      2

and DF2:

names    col6   col2   col7   total
 bbb      V      V      X      2
 ccc      X      X      V      1
 zzz      X      V      V      2

so I want to update my MAIN DF to look like so:

names    col1   col2   col3   total   total_col1   total_col2
 bbb      V      V      X      2         DF1           DF1           
                                         DF2           DF2
 ccc      V      X      X      1         DF1                        
 zzz      X      V      V      2                       DF2     
 qqq      X      V      X      1
 rrr      X      X      V      1

I hope there is possibility to do so in pandas and that the example was clear

EDIT: notice to the columns: in DF1 and DF2 there are also other columns that aren't in the original main DF, so I added only the columns that are also in the original main DF.

Upvotes: 1

Views: 154

Answers (2)

jezrael
jezrael

Reputation: 863791

You can use more general another answer.

First create list dfs of DataFrames and in list comprehension process them. Then concat them together and again use join:

df_names = ['DF1', 'DF2']
cols = ['col1','col2','col3']

dfs = [DF1, DF2]
dfs = [x.set_index('names')[cols]
        .replace({'V':df_names[i], 'X':np.nan})
        .add_prefix('total_') for i, x in enumerate(dfs)]
DF_ALL = pd.concat(dfs)
print (DF_ALL) 
      total_col1 total_col2 total_col3
names                                 
bbb          DF1        DF1        NaN
ccc          DF1        NaN        NaN
yyy          DF1        DF1        NaN
bbb          DF2        DF2        NaN
ccc          NaN        NaN        DF2
zzz          NaN        DF2        DF2

df = df.join(DF_ALL, on='names')
print (df)
  names col1 col2 col3  total total_col1 total_col2 total_col3
0   bbb    V    V    X      2        DF1        DF1        NaN
0   bbb    V    V    X      2        DF2        DF2        NaN
1   ccc    V    X    X      1        DF1        NaN        NaN
1   ccc    V    X    X      1        NaN        NaN        DF2
2   zzz    X    V    V      2        NaN        DF2        DF2
3   qqq    X    V    X      1        NaN        NaN        NaN
4   rrr    X    X    V      1        NaN        NaN        NaN

EDIT:

Solution with no names columns:

df_names = ['DF1', 'DF2']
cols = ['col1','col2','col3']

dfs = [DF1, DF2]
dfs = [x[cols].replace({'V':df_names[i], 'X':np.nan})
              .add_prefix('total_') for i, x in enumerate(dfs)]
DF_ALL = pd.concat(dfs).groupby(level=0).agg(lambda x: ', '.join(x.dropna().tolist()))
print (DF_ALL) 
    total_col1 total_col2 total_col3
bbb   DF1, DF2   DF1, DF2           
ccc        DF1                   DF2
yyy        DF1        DF1           
zzz                   DF2        DF2

df = pd.merge(df, DF_ALL, left_index=True, right_index=True, how='left')
df[DF_ALL.columns] = df[DF_ALL.columns].fillna('')
print (df)

    col1 col2 col3  total total_col1 total_col2 total_col3
bbb    V    V    X      2   DF1, DF2   DF1, DF2           
ccc    V    X    X      1        DF1                   DF2
zzz    X    V    V      2                   DF2        DF2
qqq    X    V    X      1                                 
rrr    X    X    V      1         

EDIT1:

Solution with exclude cols - use drop with list and errors='ignore' for no error if column is missing:

dfs = [DF1, DF2]
df_names = ['DF1', 'DF2']

exclude_cols = ['total','col_aaa']


dfs = [x.drop(exclude_cols, axis=1, errors='ignore')
        .replace({'V':df_names[i], 'X':np.nan})
        .add_prefix('total_') for i, x in enumerate(dfs)]
DF_ALL = pd.concat(dfs).groupby(level=0).agg(lambda x: ', '.join(x.dropna().tolist()))
print (DF_ALL) 
    total_col1 total_col2 total_col3
bbb   DF1, DF2   DF1, DF2           
ccc        DF1                   DF2
yyy        DF1        DF1           
zzz                   DF2        DF2

df = pd.merge(df, DF_ALL, left_index=True, right_index=True, how='left')
df[DF_ALL.columns] = df[DF_ALL.columns].fillna('')
print (df)
    col1 col2 col3  total total_col1 total_col2 total_col3
bbb    V    V    X      2   DF1, DF2   DF1, DF2           
ccc    V    X    X      1        DF1                   DF2
zzz    X    V    V      2                   DF2        DF2
qqq    X    V    X      1                                 
rrr    X    X    V      1                                 

EDIT2: Added filtering of columns by intersection:

print (DF2.columns.intersection(df.columns))
Index(['col2', 'total'], dtype='object')

dfs = [DF1, DF2]
df_names = ['DF1', 'DF2']

exclude_cols = ['total','col_aaa']

dfs = [x[x.columns.intersection(df.columns)].drop(exclude_cols, axis=1, errors='ignore')
        .replace({'V':df_names[i], 'X':np.nan})
        .add_prefix('total_') for i, x in enumerate(dfs)]
DF_ALL = pd.concat(dfs).groupby(level=0).agg(lambda x: ', '.join(x.dropna().tolist()))
print (DF_ALL) 
    total_col1 total_col2
bbb        DF1        DF2
ccc        DF1           
yyy        DF1           
zzz                   DF2

df = pd.merge(df, DF_ALL, left_index=True, right_index=True, how='left')
df[DF_ALL.columns] = df[DF_ALL.columns].fillna('')
print (df)
    col1 col2 col3  total total_col1 total_col2
bbb    V    V    X      2        DF1        DF2
ccc    V    X    X      1        DF1           
zzz    X    V    V      2                   DF2
qqq    X    V    X      1                      
rrr    X    X    V      1                                    

Upvotes: 1

Aaron Aben Danan
Aaron Aben Danan

Reputation: 324

It can work for you :

import pandas as pd
import numpy as np

MAIN_DF = [["bbb","V","V","X",2],
           ["ccc","V","X","X",1],
           ["zzz","X","V","V",2],
           ["qqq","X","V","X",1],
           ["rrr","X","X","V",1]]
MAIN_DF = pd.DataFrame(MAIN_DF, columns=["names", "col1","col2","col3","total"])

DF1 = [["bbb","V","V","X"],
       ["ccc","V","X","X"],
       ["yyy","V","V","X"]]
DF1 = pd.DataFrame(DF1, columns=["names", "col1","col2","col3"])
DF2 = [["bbb","V","V","X"],
       ["ccc","X","X","V"],
       ["zzz","X","V","V"]]
DF2 = pd.DataFrame(DF2, columns=["names", "col1","col2","col3"])


total_col = pd.DataFrame(data = np.zeros((MAIN_DF.shape[0],MAIN_DF.shape[1]-1)), columns=["names", "col1","col2","col3"])
total_col["names"]=MAIN_DF["names"] 


for i in xrange(total_col.shape[0]):
    name = total_col["names"][i]
    for j in xrange(DF1.shape[0]):
        if DF1["names"][j] == name:
            for col in DF1.columns[1:]:
                if DF1[col][j] == "V":
                    total_col[col][i] = "DF1"


for i in xrange(total_col.shape[0]):
    name = total_col["names"][i]
    for j in xrange(DF2.shape[0]):
        if DF2["names"][j] == name:
            for col in DF2.columns[1:]:
                if DF2[col][j] == "V":
                    if total_col[col][i] == "DF1":
                        total_col[col][i] = "DF1 DF2"
                    else:
                        total_col[col][i] = "DF2"

  names     col1     col2 col3
0   bbb  DF1 DF2  DF1 DF2    0
1   ccc      DF1        0  DF2
2   zzz        0      DF2  DF2
3   qqq        0        0    0
4   rrr        0        0    0

Upvotes: 0

Related Questions