Move non empty cells to left in grouped columns pandas

Question

I have a dataframe where there are multiple columns with similar column names. I want the empty cells to be populated with those columns which have data to the right.

Address1     Address2     Address3     Address4     Phone1     Phone2     Phone3     Phone4
ABC          nan          def          nan          9091-XYz   nan        nan        XYZ-ABZ

Should be column shifted to something like

Address1     Address2     Address3     Address4     Phone1     Phone2     Phone3     Phone4
ABC          def          nan          nan          9091-XYz   XYZ-ABZ    nan        nan

There's another question which solves a similar problem.

pdf = pd.read_csv('Data.txt',sep='	')

# gets a set of columns removing the numerical part
columns = set(map(lambda x : x.rstrip('0123456789'),pdf.columns))

for col_pattern in columns:
    # get columns with similar names
    current = [col for col in pdf.columns if col_pattern in col]
    coldf= pdf[current]
    # shift columns to the left

The file Data.txt has columns sorted by column names so all the columns with similar names come together.

Any help with this is appreciated

I had tried adding this to the above code from the link, which ran out of memory :

    newdf=pd.read_csv(StringIO(u''+re.sub(',+',',',df.to_csv()).decode('utf-8')))
    list_.append(newdf)
pd.concat(list_,axis=0).to_csv('test.txt')

jezrael · Accepted Answer

Solutions with MultiIndex and dropna:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Address1': {0: 'ABC', 1: 'ABC'}, 
                   'Address2': {0: np.nan, 1: np.nan}, 
                   'Address3': {0: 'def', 1: 'def'}, 
                   'Phone4': {0: 'XYZ-ABZ', 1: 'XYZ-ABZ'}, 
                   'Address4': {0: np.nan, 1: np.nan}, 
                   'Phone1': {0: '9091-XYz', 1: 'Z9091-XYz'}, 
                   'Phone3': {0: np.nan, 1: 'aaa'}, 
                   'Phone2': {0: np.nan, 1: np.nan}})

print (df)
  Address1  Address2 Address3  Address4     Phone1  Phone2 Phone3   Phone4
0      ABC       NaN      def       NaN   9091-XYz     NaN    NaN  XYZ-ABZ
1      ABC       NaN      def       NaN  Z9091-XYz     NaN    aaa  XYZ-ABZ

#multiindex from columns of df
cols = df.columns.str.extract('([[A-Za-z]+)(\d+)', expand=True).values.tolist()

mux = pd.MultiIndex.from_tuples(cols)
df.columns = mux
print (df)
  Address                   Phone                  
        1   2    3   4          1   2    3        4
0     ABC NaN  def NaN   9091-XYz NaN  NaN  XYZ-ABZ
1     ABC NaN  def NaN  Z9091-XYz NaN  aaa  XYZ-ABZ

#unstack, remove NaN rows, convert to df (because cumcount)
df1 = df.unstack().dropna().reset_index(level=1, drop=True).to_frame()
#create new level of index
df1['g'] = (df1.groupby(level=[0,1]).cumcount() + 1).astype(str)
#add column g to multiindex
df1.set_index('g', append=True, inplace=True)
#reshape to original
df1 = df1.unstack(level=[0,2])
#remove first level of multiindex of column (0 from to_frame)
df1.columns = df1.columns.droplevel(0)
#reindex and replace None to NaN
df1 = df1.reindex(columns=mux).replace({None: np.nan})
#'reset' multiindex in columns
df1.columns = [''.join(col) for col in df1.columns]
print (df1)
  Address1 Address2  Address3  Address4     Phone1   Phone2   Phone3  Phone4
0      ABC      def       NaN       NaN   9091-XYz  XYZ-ABZ      NaN     NaN
1      ABC      def       NaN       NaN  Z9091-XYz      aaa  XYZ-ABZ     NaN

Old solution:

I find another problem - solution above doest work correctly if more rows in DataFrame. So you can use double apply. But problem of this solution is uncorrect order of values in rows:

df = pd.DataFrame({'Address1': {0: 'ABC', 1: 'ABC'}, 'Address2': {0: np.nan, 1: np.nan}, 'Address3': {0: 'def', 1: 'def'}, 'Phone4': {0: 'XYZ-ABZ', 1: 'XYZ-ABZ'}, 'Address4': {0: np.nan, 1: np.nan}, 'Phone1': {0: '9091-XYz', 1: '9091-XYz'}, 'Phone3': {0: np.nan, 1: 'aaa'}, 'Phone2': {0: np.nan, 1: np.nan}})

print (df)
  Address1  Address2 Address3  Address4    Phone1  Phone2 Phone3   Phone4
0      ABC       NaN      def       NaN  9091-XYz     NaN    NaN  XYZ-ABZ
1      ABC       NaN      def       NaN  9091-XYz     NaN    aaa  XYZ-ABZ 

cols = df.columns.str.extract('([[A-Za-z]+)(\d+)', expand=True).values.tolist()
mux = pd.MultiIndex.from_tuples(cols)
df.columns = mux

df = df.groupby(axis=1, level=0)
       .apply(lambda x: x.apply(lambda y: y.sort_values().values, axis=1))

df.columns = [''.join(col) for col in df.columns]
print (df)
  Address1 Address2  Address3  Address4    Phone1   Phone2 Phone3  Phone4
0      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    NaN     NaN
1      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    aaa     NaN

Also I try modify piRSquared solution - then you does not need MultiIndex:

coltype = df.columns.str.extract(r'([[A-Za-z]+)', expand=False)
print (coltype)
Index(['Address', 'Address', 'Address', 'Address', 'Phone', 'Phone', 'Phone',
       'Phone'],
      dtype='object')

df = df.groupby(coltype, axis=1)
       .apply(lambda x: x.apply(lambda y: y.sort_values().values, axis=1))
print (df)
  Address1 Address2  Address3  Address4    Phone1   Phone2 Phone3  Phone4
0      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    NaN     NaN
1      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    aaa     NaN

Move non empty cells to left in grouped columns pandas

Answers (2)

Related Questions