Reputation: 7997
I'm trying to integrate multiple data sources and I'm finding it a bit tricky to concisely combine my columns. I have a dataframe that looks something like this:
df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])
print df
Address1 City1 State1 Zip1 Address2 City2/
0 Address Data City Data State Data Zip Data NaN NaN
1 NaN NaN NaN NaN Address Data City Data
2 NaN NaN NaN NaN NaN NaN
State2 Zip2 Address3 City3 State3 Zip3
0 NaN NaN NaN NaN NaN NaN
1 State Data Zip Data NaN NaN NaN NaN
2 NaN NaN Address Data City Data State Data Zip Data
I'm trying to combine to something like this:
#run code here to combine multiple versions of the columns
print df
Address City State Zip
0 Address Data City data State Data Zip Data
1 Address Data City data State Data Zip Data
2 Address Data City data State Data Zip Data
I was hoping I could use lines like this to set multiple columns at once:
df.loc[df['State1'].notnull(),['Address','city','state','State','Zip']] = df.loc[df['State1'].notnull(),['Address1','City1','State1','Zip1']].values
However, .loc doesn't allow you to create multiple columns at once. Any thoughts on how to combine the data would be much appreciated!
Edit based on solution below
The problem with that solution is my data may actually look more like this:
df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', 'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])
In this case, the soltuion below gives me 4 rows, when I only want the original 3. What I want to say is "if section 1 has data, use section 1, else if section 2 has data use section 2, else if section 3 has data, use section 3". Each row needs to remain unique, because there are many other attributes of a row that I am not consolidating. Thanks!
Upvotes: 2
Views: 1853
Reputation: 210982
what about this?
uniq_cols = df.columns.str.replace(r'\d+$', '').unique()
new = pd.DataFrame(columns=uniq_cols)
for col in uniq_cols:
new[col] = df.filter(like=col).apply(lambda x: x.dropna()[0], axis=1)
Output:
In [282]: new
Out[282]:
Address City State Zip
0 Address Data City data State Data Zip Data
1 Address Data City data State Data Zip Data
2 Address Data City data State Data Zip Data
Explanation:
In [283]: uniq_cols
Out[283]: array(['Address', 'City', 'State', 'Zip'], dtype=object)
In [284]: df.filter(like='Address').apply(lambda x: x.dropna()[0], axis=1)
Out[284]:
0 Address Data
1 Address Data
2 Address Data
dtype: object
Upvotes: 0
Reputation: 863801
I think solution is totally different, so I decide create new answer:
import pandas as pd
import numpy as np
#random dataframe
np.random.seed(1)
df1 = pd.DataFrame(np.random.randint(10, size=(3,9)))
df1.columns = ['a1','b1','c1','a2','b2','c2','a3','b3','c3']
df1.loc[[1,2],['a1','b1','c1']] = np.nan
print (df1)
a1 b1 c1 a2 b2 c2 a3 b3 c3
0 5.0 8.0 9.0 5 0 0 1 7 6
1 NaN NaN NaN 5 2 4 2 4 7
2 NaN NaN NaN 7 0 6 9 9 7
#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)
#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')
#remove index and columns names (pandas 0.18.0+)
df = df.rename_axis(None).rename_axis(None, axis=1)
print (df)
a b c
0 5.0 8.0 9.0
1 5.0 2.0 4.0
2 9.0 9.0 7.0
EDIT: I try modify a bit your sample:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([['Address Data1','City data1','State Data1', 'Zip Data1', 'Address Data2','City data2','State Data2', 'Zip Data2',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,'Address Data3','City data3','State Data3', 'Zip Data3',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data4','City data4','State Data4', 'Zip Data4']],
columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])
print (df1)
Address1 City1 State1 Zip1 Address2 \
0 Address Data1 City data1 State Data1 Zip Data1 Address Data2
1 NaN NaN NaN NaN Address Data3
2 NaN NaN NaN NaN NaN
City2 State2 Zip2 Address3 City3 State3 \
0 City data2 State Data2 Zip Data2 NaN NaN NaN
1 City data3 State Data3 Zip Data3 NaN NaN NaN
2 NaN NaN NaN Address Data4 City data4 State Data4
Zip3
0 NaN
1 NaN
2 Zip Data4
#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)
#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')
df = df.rename_axis(None).rename_axis(None, axis=1)
print (df)
Address City State Zip
0 Address Data1 City data1 State Data1 Zip Data1
1 Address Data3 City data3 State Data3 Zip Data3
2 Address Data4 City data4 State Data4 Zip Data4
Upvotes: 1
Reputation: 863801
IIUC you can first select column names by list
comprehension and then use pd.lreshape
:
a = ([col for col in df.columns if col.startswith('Address')])
c = ([col for col in df.columns if col.startswith('City')])
s = ([col for col in df.columns if col.startswith('State')])
z = ([col for col in df.columns if col.startswith('Zip')])
print (a)
print (c)
print (s)
print (z)
['Address1', 'Address2', 'Address3']
['City1', 'City2', 'City3']
['State1', 'State2', 'State3']
['Zip1', 'Zip2', 'Zip3']
df1 = pd.lreshape(df, {'Address':a, 'City' :c, 'State':s, 'Zip' :z})
print (df1)
Address State City Zip
0 Address Data State Data City data Zip Data
1 Address Data State Data City data Zip Data
2 Address Data State Data City data Zip Data
EDIT:
If you need NOT drop rows with NaN
values, add parameter dropna=False
:
df1 = pd.lreshape(df, {'Address':a, 'City' :c, 'State':s, 'Zip' :z}, dropna=False)
print (df1)
Address State City Zip
0 Address Data State Data City data Zip Data
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 Address Data State Data City data Zip Data
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 Address Data State Data City data Zip Data
Another numpy solution, but data can be nice ordered (in sample yes, in real maybe not):
print (pd.DataFrame((df.values.reshape(9,4)), columns=['Address','City','State','Zip'])
.dropna())
Address City State Zip
0 Address Data City data State Data Zip Data
4 Address Data City data State Data Zip Data
8 Address Data City data State Data Zip Data
Upvotes: 1