flyingmeatball
flyingmeatball

Reputation: 7997

pandas condensing multiple columns for same value

I'm trying to integrate multiple data sources and I'm finding it a bit tricky to concisely combine my columns. I have a dataframe that looks something like this:

df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
     [ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
                  [ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
                 columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])

print df

       Address1      City1      State1      Zip1      Address2      City2/  
0  Address Data  City Data  State Data  Zip Data           NaN        NaN   
1           NaN        NaN         NaN       NaN  Address Data  City Data   
2           NaN        NaN         NaN       NaN           NaN        NaN   

       State2      Zip2      Address3      City3      State3      Zip3  
0         NaN       NaN           NaN        NaN         NaN       NaN  
1  State Data  Zip Data           NaN        NaN         NaN       NaN  
2         NaN       NaN  Address Data  City Data  State Data  Zip Data 

I'm trying to combine to something like this:

#run code here to combine multiple versions of the columns

print df

   Address       City       State       Zip
0  Address Data  City data  State Data  Zip Data
1  Address Data  City data  State Data  Zip Data
2  Address Data  City data  State Data  Zip Data 

I was hoping I could use lines like this to set multiple columns at once:

df.loc[df['State1'].notnull(),['Address','city','state','State','Zip']] = df.loc[df['State1'].notnull(),['Address1','City1','State1','Zip1']].values

However, .loc doesn't allow you to create multiple columns at once. Any thoughts on how to combine the data would be much appreciated!

Edit based on solution below

The problem with that solution is my data may actually look more like this:

df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', 'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
 [ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
              [ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
             columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])

In this case, the soltuion below gives me 4 rows, when I only want the original 3. What I want to say is "if section 1 has data, use section 1, else if section 2 has data use section 2, else if section 3 has data, use section 3". Each row needs to remain unique, because there are many other attributes of a row that I am not consolidating. Thanks!

Upvotes: 2

Views: 1853

Answers (3)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210982

what about this?

uniq_cols = df.columns.str.replace(r'\d+$', '').unique()

new = pd.DataFrame(columns=uniq_cols)

for col in uniq_cols:
    new[col] = df.filter(like=col).apply(lambda x: x.dropna()[0], axis=1)

Output:

In [282]: new
Out[282]:
        Address       City       State       Zip
0  Address Data  City data  State Data  Zip Data
1  Address Data  City data  State Data  Zip Data
2  Address Data  City data  State Data  Zip Data

Explanation:

In [283]: uniq_cols
Out[283]: array(['Address', 'City', 'State', 'Zip'], dtype=object)

In [284]: df.filter(like='Address').apply(lambda x: x.dropna()[0], axis=1)
Out[284]:
0    Address Data
1    Address Data
2    Address Data
dtype: object

Upvotes: 0

jezrael
jezrael

Reputation: 863801

I think solution is totally different, so I decide create new answer:

import pandas as pd
import numpy as np

#random dataframe
np.random.seed(1)
df1 = pd.DataFrame(np.random.randint(10, size=(3,9)))
df1.columns = ['a1','b1','c1','a2','b2','c2','a3','b3','c3']

df1.loc[[1,2],['a1','b1','c1']] = np.nan
print (df1)
    a1   b1   c1  a2  b2  c2  a3  b3  c3
0  5.0  8.0  9.0   5   0   0   1   7   6
1  NaN  NaN  NaN   5   2   4   2   4   7
2  NaN  NaN  NaN   7   0   6   9   9   7

#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)

#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')
#remove index and columns names (pandas 0.18.0+)
df = df.rename_axis(None).rename_axis(None, axis=1)

print (df)
     a    b    c
0  5.0  8.0  9.0
1  5.0  2.0  4.0
2  9.0  9.0  7.0

EDIT: I try modify a bit your sample:

import pandas as pd
import numpy as np

df1 = pd.DataFrame([['Address Data1','City data1','State Data1', 'Zip Data1', 'Address Data2','City data2','State Data2', 'Zip Data2',np.nan,np.nan,np.nan,np.nan],
 [ np.nan,np.nan,np.nan,np.nan,'Address Data3','City data3','State Data3', 'Zip Data3',np.nan,np.nan,np.nan,np.nan],
              [ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data4','City data4','State Data4', 'Zip Data4']],
             columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])

print (df1)
        Address1       City1       State1       Zip1       Address2  \
0  Address Data1  City data1  State Data1  Zip Data1  Address Data2   
1            NaN         NaN          NaN        NaN  Address Data3   
2            NaN         NaN          NaN        NaN            NaN   

        City2       State2       Zip2       Address3       City3       State3  \
0  City data2  State Data2  Zip Data2            NaN         NaN          NaN   
1  City data3  State Data3  Zip Data3            NaN         NaN          NaN   
2         NaN          NaN        NaN  Address Data4  City data4  State Data4   

        Zip3  
0        NaN  
1        NaN  
2  Zip Data4  
#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)

#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')

df = df.rename_axis(None).rename_axis(None, axis=1)
print (df)
         Address        City        State        Zip
0  Address Data1  City data1  State Data1  Zip Data1
1  Address Data3  City data3  State Data3  Zip Data3
2  Address Data4  City data4  State Data4  Zip Data4

Upvotes: 1

jezrael
jezrael

Reputation: 863801

IIUC you can first select column names by list comprehension and then use pd.lreshape:

a = ([col for col in df.columns if col.startswith('Address')])
c = ([col for col in df.columns if col.startswith('City')])
s = ([col for col in df.columns if col.startswith('State')])
z = ([col for col in df.columns if col.startswith('Zip')])

print (a)
print (c)
print (s)
print (z)
['Address1', 'Address2', 'Address3']
['City1', 'City2', 'City3']
['State1', 'State2', 'State3']
['Zip1', 'Zip2', 'Zip3']

df1 = pd.lreshape(df, {'Address':a,  'City' :c, 'State':s,  'Zip' :z}) 
print (df1)

        Address       State       City       Zip
0  Address Data  State Data  City data  Zip Data
1  Address Data  State Data  City data  Zip Data
2  Address Data  State Data  City data  Zip Data

EDIT:

If you need NOT drop rows with NaN values, add parameter dropna=False:

df1 = pd.lreshape(df, {'Address':a,  'City' :c, 'State':s,  'Zip' :z}, dropna=False) 
print (df1)
        Address       State       City       Zip
0  Address Data  State Data  City data  Zip Data
1           NaN         NaN        NaN       NaN
2           NaN         NaN        NaN       NaN
3           NaN         NaN        NaN       NaN
4  Address Data  State Data  City data  Zip Data
5           NaN         NaN        NaN       NaN
6           NaN         NaN        NaN       NaN
7           NaN         NaN        NaN       NaN
8  Address Data  State Data  City data  Zip Data

Another numpy solution, but data can be nice ordered (in sample yes, in real maybe not):

print (pd.DataFrame((df.values.reshape(9,4)), columns=['Address','City','State','Zip'])
         .dropna())

        Address       City       State       Zip
0  Address Data  City data  State Data  Zip Data
4  Address Data  City data  State Data  Zip Data
8  Address Data  City data  State Data  Zip Data

Upvotes: 1

Related Questions