ashish
ashish

Reputation: 239

How to create new column dynamically in dataframe in python

df1 = pd.DataFrame(
{
    "empid" : [1,2,3,4,5,6],
    "empname" : ['a', 'b','c','d','e','f'],
    "empcity" : ['aa','bb','cc','dd','ee','ff']
})
df1

df2 = pd.DataFrame(
{
    "empid" : [1,2,3,4,5,6],
    "empname" : ['a', 'b','m','d','n','f'],
    "empcity" : ['aa','bb','cc','ddd','ee','fff']
})
df2

df_all = pd.concat([df1.set_index('empid'),df2.set_index('empid')],axis='columns',keys=['first','second'])
df_all

df_final = df_all.swaplevel(axis = 'columns')[df1.columns[1:]]
df_final

Based on df_final data frame, need to create following output. here comparison column need to created dynamically for every identical column as i'm trying to compare two data frame(both data frame structure and column name are same) where number of columns are more than 300

enter image description here

Upvotes: 0

Views: 348

Answers (3)

Akshay Sehgal
Akshay Sehgal

Reputation: 19322

Directly comparing 2 dataframes with ==

You can do this with a simple == between two dataframes that you need to compare. Let's start with the original 2 dataframes df1 and df2 -

first = df1.set_index('empid')
second = df2.set_index('empid')
comparisons = first==second      #<---

output = pd.concat([first, second, comparisons], axis=1,keys=['first','second', 'comparisons'])

#Swapping level and reindexing, borrowed from Jezrael's excellent answer
output = output.swaplevel(axis=1).reindex(first.columns, axis=1, level=0)
print(output)
      empname                    empcity                   
        first second comparisons   first second comparisons
empid                                                      
1           a      a        True      aa     aa        True
2           b      b        True      bb     bb        True
3           c      m       False      cc     cc        True
4           d      d        True      dd    ddd       False
5           e      n       False      ee     ee        True
6           f      f        True      ff    fff       False

Alternate approach with pandas groupby

In addition to the excellent answer by jezrael, I am adding an alternate way of doing this using pandas groupby.

  1. Tranpose to get columns as row indexes
  2. Groupby on first level which contains empcity and empname
  3. Apply comparison between the 2 rows
  4. Transpose back to columns
  5. Add multi index columns by product of original columns and "comparisons"
  6. Combine the two dataframes (original one and one with comparisons)
  7. Use swaplevel and reindex to get the order of columns that you need
#create comparisons
comparisons = (df_all.T
                     .groupby(level=-1)
                     .apply(lambda x: x.iloc[0]==x.iloc[1])
                     .T)

#add multi index columns
comparisons.columns = pd.MultiIndex.from_product([['comparison'],comparisons.columns])

#concatenate with original data
df_final = pd.concat([df_all, comparisons], axis='columns')

#Swapping level and reindexing, borrowed from Jezrael's excellent answer
df_final = (df_final.swaplevel(axis = 'columns')
                    .reindex(df1.set_index('empid')
                                .columns, axis=1, level=0))
print(df_final)
      empname                   empcity                  
        first second comparison   first second comparison
empid                                                    
1           a      a       True      aa     aa       True
2           b      b       True      bb     bb       True
3           c      m      False      cc     cc       True
4           d      d       True      dd    ddd      False
5           e      n      False      ee     ee       True
6           f      f       True      ff    fff      False

Upvotes: 1

jezrael
jezrael

Reputation: 862511

Use DataFrame.stack for possible compare all levels columns first with second, create new column in DataFrame.assign and reshape back by DataFrame.unstack with DataFrame.swaplevel and DataFrame.reindex for original order:

#original ordering
orig = df1.columns[1:].tolist()
print (orig)
['empname', 'empcity']

df_final = (df_all.stack()
                  .assign(comparions=lambda x: x['first'].eq(x['second']))
                  .unstack()
                  .swaplevel(axis = 'columns')
                  .reindex(orig, axis=1, level=0))
print (df_final)
      empname                   empcity                  
        first second comparions   first second comparions
empid                                                    
1           a      a       True      aa     aa       True
2           b      b       True      bb     bb       True
3           c      m      False      cc     cc       True
4           d      d       True      dd    ddd      False
5           e      n      False      ee     ee       True
6           f      f       True      ff    fff      False

Upvotes: 2

user7864386
user7864386

Reputation:

(i) Use get_level_values to get the label values for level 0

(ii) Iterate over the outcome of (i) and for each level=0, do element-wise comparison using eq between first and second

(iii) use sort_index to sort columns in desired order

for level_0 in df_final.columns.get_level_values(0).unique():
    df_final[(level_0, 'comparison')] = df_final[(level_0, 'first')].eq(df_final[(level_0,'second')])
df_final = df_final.sort_index(level=0, sort_remaining=False, axis=1)

Output:

      empcity                   empname                  
        first second comparison   first second comparison
empid                                                    
1          aa     aa       True       a      a       True
2          bb     bb       True       b      b       True
3          cc     cc       True       c      m      False
4          dd    ddd      False       d      d       True
5          ee     ee       True       e      n      False
6          ff    fff      False       f      f       True

Upvotes: 1

Related Questions