Arnold Klein
Arnold Klein

Reputation: 3086

Merging columns and removing duplicates with Pandas

I need to merge similar columns and remove duplicates (entries with the same date). The data frame:

     Albumin  C-reactive protein    CRP  Ferritin  Haemoglobin     Hb  Iron  Nancy Index  Plasma  Platelets  Transferrin saturation %  Transferrin saturations  UCEIS (0 to 8)    WCC  White Cell Count           test_date
0        NaN                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             NaN    NaN             12.35 2016-04-17 23:00:00              
1        NaN                 NaN    NaN       NaN        133.0    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
2        NaN                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN      406.0                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
3        NaN                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN      406.0                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
4        NaN                32.2    NaN       NaN          NaN    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00             
5       36.0                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
6        NaN                 NaN    NaN      99.7          NaN    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
7        NaN                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN        NaN                      25.0                      NaN             NaN    NaN               NaN 2016-04-17 23:00:00              
12      36.0                 NaN   32.2      99.7          NaN  133.0   NaN          NaN     NaN      406.0                       NaN                     25.0             NaN  12.35               NaN 2016-04-17 23:00:00              
14       NaN                 NaN    NaN       NaN          NaN    NaN   NaN          NaN     NaN        NaN                       NaN                      NaN             7.0    NaN               NaN 2016-04-25 23:00:00              
79      34.0                 NaN    5.4      55.9          NaN  133.0   NaN          NaN     NaN      372.0                       NaN                     28.0             NaN   7.99               NaN 2016-06-12 23:00:00              

I need to get:

     Albumin    CRP  Ferritin     Hb  Nancy Index  Plasma  Platelets  Transferrin saturations  UCEIS (0 to 8)    WCC           test_date
12      36.0   32.2      99.7  133.0          NaN     NaN      406.0                     25.0             NaN  12.35 2016-04-17 23:00:00
14       NaN    NaN       NaN    NaN          NaN     NaN        NaN                      NaN             7.0    NaN 2016-04-25 23:00:00
79      34.0    5.4      55.9  133.0          NaN     NaN      372.0                     28.0             NaN   7.99 2016-06-12 23:00:00

So, columns 'C-reactive protein' should be merged with 'CRP', 'Hemoglobin' with 'Hb', 'Transferrin saturation %' with 'Transferrin saturation'.

I can easily remove duplicates with .drop_duplicates(), but the trick is remove not only row with the same date, but also to make sure, that the values in the same column are duplicated. For example, 'C-reactive protein' at row '4' has the same values as 'CRP' in row '12', in addition, they both have the same entry date. Given all that, I need to have only 'CRP' column with values 32.2 and the date '2016-04-17' (plus other unique columns).

EDIT

Some entries are really duplicates (absolutely identical, due to system glitches), for example (last three rows, on 2016-06-20, indices '803' and '122'). Is the solution below capable of removing such identical rows?

P.S. Thanks for the amazing and general solution for duplicate, but not identical entries.

     Albumin  C-reactive protein  CRP  Ferritin  Haemoglobin     Hb  Iron  Nancy Index  Plasma  Platelets  Transferrin saturation %  Transferrin saturations  UCEIS (0 to 8)   WCC  White Cell Count          setName           test_date
735     39.0                 NaN  0.4      52.0          NaN  144.0   NaN          NaN     NaN      197.0                       NaN                     25.0             NaN  4.88               NaN           Bloods 2016-05-31 23:00:00
803     40.0                 NaN  0.2      81.0          NaN  147.0   NaN          NaN     NaN      234.0                       NaN                     35.0             NaN  8.47               NaN           Bloods 2016-06-20 23:00:00
347      NaN                 NaN  NaN       NaN          NaN    NaN   NaN          NaN     1.0        NaN                       NaN                      NaN             NaN   NaN               NaN  Research Bloods 2016-06-20 23:00:00
122     40.0                 NaN  0.2      81.9          NaN  147.0   NaN          NaN     NaN      234.0                       NaN                     35.0             NaN  8.47               NaN           Bloods 2016-06-20 23:00:00

Upvotes: 1

Views: 3786

Answers (2)

Stael
Stael

Reputation: 2689

What @jezrael was saying is that if you had a situation where:

     Albumin  C-reactive protein    CRP  test_date
0        NaN                 NaN    32       2016-04-17 23:00:00              
1        NaN                 8.0    NaN      2016-04-17 23:00:00

then his method would erase the 8.0 reading and keep only the 32 (this is because he does it in two steps (or 3?), in this line: df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()

df = df.groupby('test_date').max() # selects max of each column 
                                   # while collapsing 'test_date'

which for my truncated example would give:

         Albumin  C-reactive protein    CRP  test_date
0        NaN                 8.0    32       2016-04-17 23:00:00

then rename .rename(columns=d) giving:

         Albumin  CRP    CRP  test_date
0        NaN      8.0    32   2016-04-17 23:00:00

then .groupby(axis=1, level=0).max() to group along rows (instead of down columns) which gives:

         Albumin  CRP  test_date
0        NaN      32   2016-04-17 23:00:00

which is where you run the highest risk of losing data.

Alternative

I would split the original data into two frames first

df1 = df[["C-reactive protein","Haemoglobin", ...]]
df2 = df[["CRP", "Hb"]]

# then rename

df2 = df2.rename(columns={"CRP":"C-reactive protein", "Hb":"Haemoglobin", ...})

# use concat to stack them on one another

df3 = pd.concat([df1, df2]) # i've run out of names

df3 = df3.drop_duplicates() # perhaps also drop NAs?

but this is only necessary if you have multiple non-duplicate entries for the same test on the same day.

Upvotes: 1

jezrael
jezrael

Reputation: 862441

I think you need groupby with rename columns by dict:

d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb', 
     'Transferrin saturation %':'Transferrin saturations'}
df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
print (df)
                     Albumin   CRP  Ferritin  Haemoglobin     Hb  Iron  \
test_date                                                                
2016-04-17 23:00:00     36.0  32.2      99.7        133.0  133.0   NaN   
2016-04-25 23:00:00      NaN   NaN       NaN          NaN    NaN   NaN   
2016-06-12 23:00:00     34.0   5.4      55.9          NaN  133.0   NaN   

                     Nancy Index  Plasma  Platelets  Transferrin saturations  \
test_date                                                                      
2016-04-17 23:00:00          NaN     NaN      406.0                     25.0   
2016-04-25 23:00:00          NaN     NaN        NaN                      NaN   
2016-06-12 23:00:00          NaN     NaN      372.0                     28.0   

                     UCEIS (0 to 8)    WCC  White Cell Count  
test_date                                                     
2016-04-17 23:00:00             NaN  12.35             12.35  
2016-04-25 23:00:00             7.0    NaN               NaN  
2016-06-12 23:00:00             NaN   7.99               NaN  

More general solution is reshape by melt, remove duplicates and then create DataFrame back:

d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb', 
'Transferrin saturation %':'Transferrin saturations'}
df = df.rename(columns=d).groupby(axis=1, level=0).max()

df = pd.melt(df, id_vars='test_date').dropna(subset=['value']).drop_duplicates()

df = df.groupby(['test_date','variable'])['value'] \
       .apply(lambda x: pd.Series(x.values)) \
       .unstack(1) \
       .reset_index(level=1, drop=True) \
       .reset_index() \
       .rename_axis(None,axis=1)
print (df)

             test_date  Albumin   CRP  Ferritin      Hb  Platelets  \
0  2016-04-17 23:00:00   1000.0  32.2      99.7  1000.0      406.0   
1  2016-04-17 23:00:00     36.0   NaN       NaN   133.0        NaN   
2  2016-04-25 23:00:00      NaN   NaN       NaN     NaN        NaN   
3  2016-06-12 23:00:00     34.0   5.4      55.9   133.0      372.0   

   Transferrin saturations  UCEIS (0 to 8)    WCC  White Cell Count  
0                     25.0             NaN  12.35             12.35  
1                      NaN             NaN    NaN               NaN  
2                      NaN             7.0    NaN               NaN  
3                     28.0             NaN   7.99               NaN  

Upvotes: 2

Related Questions