Reputation: 565
I want to make a df that in the end of the df, there are columns that gets their data from revious dfs. I want to check in other dfs, if the ids in the first columns exists there, if so, check in which columns they appear, and add it to the last column in the main df.
EXAMPLE:
I have a MAIN DF that looks like so:
names col1 col2 col3 total
bbb V V X 2
ccc V X X 1
zzz X V V 2
qqq X V X 1
rrr X X V 1
and for example I have two more dfs (in general there are more then just two more dfs, so I want to run on all of them in a loop or something), DF1:
names col1 col4 col5 total
bbb V V X 2
ccc V X X 1
yyy V V X 2
and DF2:
names col6 col2 col7 total
bbb V V X 2
ccc X X V 1
zzz X V V 2
so I want to update my MAIN DF to look like so:
names col1 col2 col3 total total_col1 total_col2
bbb V V X 2 DF1 DF1
DF2 DF2
ccc V X X 1 DF1
zzz X V V 2 DF2
qqq X V X 1
rrr X X V 1
I hope there is possibility to do so in pandas and that the example was clear
EDIT: notice to the columns: in DF1
and DF2
there are also other columns that aren't in the original main DF, so I added only the columns that are also in the original main DF.
Upvotes: 1
Views: 154
Reputation: 863791
You can use more general another answer.
First create list
dfs
of DataFrames
and in list comprehension process them. Then concat
them together and again use join
:
df_names = ['DF1', 'DF2']
cols = ['col1','col2','col3']
dfs = [DF1, DF2]
dfs = [x.set_index('names')[cols]
.replace({'V':df_names[i], 'X':np.nan})
.add_prefix('total_') for i, x in enumerate(dfs)]
DF_ALL = pd.concat(dfs)
print (DF_ALL)
total_col1 total_col2 total_col3
names
bbb DF1 DF1 NaN
ccc DF1 NaN NaN
yyy DF1 DF1 NaN
bbb DF2 DF2 NaN
ccc NaN NaN DF2
zzz NaN DF2 DF2
df = df.join(DF_ALL, on='names')
print (df)
names col1 col2 col3 total total_col1 total_col2 total_col3
0 bbb V V X 2 DF1 DF1 NaN
0 bbb V V X 2 DF2 DF2 NaN
1 ccc V X X 1 DF1 NaN NaN
1 ccc V X X 1 NaN NaN DF2
2 zzz X V V 2 NaN DF2 DF2
3 qqq X V X 1 NaN NaN NaN
4 rrr X X V 1 NaN NaN NaN
EDIT:
Solution with no names
columns:
df_names = ['DF1', 'DF2']
cols = ['col1','col2','col3']
dfs = [DF1, DF2]
dfs = [x[cols].replace({'V':df_names[i], 'X':np.nan})
.add_prefix('total_') for i, x in enumerate(dfs)]
DF_ALL = pd.concat(dfs).groupby(level=0).agg(lambda x: ', '.join(x.dropna().tolist()))
print (DF_ALL)
total_col1 total_col2 total_col3
bbb DF1, DF2 DF1, DF2
ccc DF1 DF2
yyy DF1 DF1
zzz DF2 DF2
df = pd.merge(df, DF_ALL, left_index=True, right_index=True, how='left')
df[DF_ALL.columns] = df[DF_ALL.columns].fillna('')
print (df)
col1 col2 col3 total total_col1 total_col2 total_col3
bbb V V X 2 DF1, DF2 DF1, DF2
ccc V X X 1 DF1 DF2
zzz X V V 2 DF2 DF2
qqq X V X 1
rrr X X V 1
EDIT1:
Solution with exclude cols - use drop
with list
and errors='ignore'
for no error if column is missing:
dfs = [DF1, DF2]
df_names = ['DF1', 'DF2']
exclude_cols = ['total','col_aaa']
dfs = [x.drop(exclude_cols, axis=1, errors='ignore')
.replace({'V':df_names[i], 'X':np.nan})
.add_prefix('total_') for i, x in enumerate(dfs)]
DF_ALL = pd.concat(dfs).groupby(level=0).agg(lambda x: ', '.join(x.dropna().tolist()))
print (DF_ALL)
total_col1 total_col2 total_col3
bbb DF1, DF2 DF1, DF2
ccc DF1 DF2
yyy DF1 DF1
zzz DF2 DF2
df = pd.merge(df, DF_ALL, left_index=True, right_index=True, how='left')
df[DF_ALL.columns] = df[DF_ALL.columns].fillna('')
print (df)
col1 col2 col3 total total_col1 total_col2 total_col3
bbb V V X 2 DF1, DF2 DF1, DF2
ccc V X X 1 DF1 DF2
zzz X V V 2 DF2 DF2
qqq X V X 1
rrr X X V 1
EDIT2: Added filtering of columns by intersection
:
print (DF2.columns.intersection(df.columns))
Index(['col2', 'total'], dtype='object')
dfs = [DF1, DF2]
df_names = ['DF1', 'DF2']
exclude_cols = ['total','col_aaa']
dfs = [x[x.columns.intersection(df.columns)].drop(exclude_cols, axis=1, errors='ignore')
.replace({'V':df_names[i], 'X':np.nan})
.add_prefix('total_') for i, x in enumerate(dfs)]
DF_ALL = pd.concat(dfs).groupby(level=0).agg(lambda x: ', '.join(x.dropna().tolist()))
print (DF_ALL)
total_col1 total_col2
bbb DF1 DF2
ccc DF1
yyy DF1
zzz DF2
df = pd.merge(df, DF_ALL, left_index=True, right_index=True, how='left')
df[DF_ALL.columns] = df[DF_ALL.columns].fillna('')
print (df)
col1 col2 col3 total total_col1 total_col2
bbb V V X 2 DF1 DF2
ccc V X X 1 DF1
zzz X V V 2 DF2
qqq X V X 1
rrr X X V 1
Upvotes: 1
Reputation: 324
It can work for you :
import pandas as pd
import numpy as np
MAIN_DF = [["bbb","V","V","X",2],
["ccc","V","X","X",1],
["zzz","X","V","V",2],
["qqq","X","V","X",1],
["rrr","X","X","V",1]]
MAIN_DF = pd.DataFrame(MAIN_DF, columns=["names", "col1","col2","col3","total"])
DF1 = [["bbb","V","V","X"],
["ccc","V","X","X"],
["yyy","V","V","X"]]
DF1 = pd.DataFrame(DF1, columns=["names", "col1","col2","col3"])
DF2 = [["bbb","V","V","X"],
["ccc","X","X","V"],
["zzz","X","V","V"]]
DF2 = pd.DataFrame(DF2, columns=["names", "col1","col2","col3"])
total_col = pd.DataFrame(data = np.zeros((MAIN_DF.shape[0],MAIN_DF.shape[1]-1)), columns=["names", "col1","col2","col3"])
total_col["names"]=MAIN_DF["names"]
for i in xrange(total_col.shape[0]):
name = total_col["names"][i]
for j in xrange(DF1.shape[0]):
if DF1["names"][j] == name:
for col in DF1.columns[1:]:
if DF1[col][j] == "V":
total_col[col][i] = "DF1"
for i in xrange(total_col.shape[0]):
name = total_col["names"][i]
for j in xrange(DF2.shape[0]):
if DF2["names"][j] == name:
for col in DF2.columns[1:]:
if DF2[col][j] == "V":
if total_col[col][i] == "DF1":
total_col[col][i] = "DF1 DF2"
else:
total_col[col][i] = "DF2"
names col1 col2 col3
0 bbb DF1 DF2 DF1 DF2 0
1 ccc DF1 0 DF2
2 zzz 0 DF2 DF2
3 qqq 0 0 0
4 rrr 0 0 0
Upvotes: 0