Reputation: 23
I have a dataframe that is 762106 rows x 79 columns. There are 14 'sets' of three columns with each column indicating a different level of 'intensity' for a given feature, and NaN where there is a value in another column. They are already encoded and I want to condense them into a single column so that instead of 42 of these columns I have 14.
A subset can be recreated like this:
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 1, np.nan, np.nan, np.nan, np.nan, 3],
[1, np.nan, np.nan, np.nan, 2, np.nan, 1, np.nan, np.nan],
[np.nan, np.nan, 3, 1, np.nan, np.nan, np.nan, 2, np.nan]],
columns=['a','aa','aaa','b','bb','bbb','c','cc','ccc'])
Output:
a aa aaa b bb bbb c cc ccc
0 NaN 2.0 NaN 1.0 NaN NaN NaN NaN 3.0
1 1.0 NaN NaN NaN 2.0 NaN 1.0 NaN NaN
2 NaN NaN 3.0 1.0 NaN NaN NaN 2.0 NaN
I want them to look like this:
a b c
0 2 1 3
1 1 2 1
2 3 1 2
My current solution is to take values from aa, aaa, etc using .fillna()
and then use .drop()
to drop the superfluous columns:
df['a'] = df['a'].fillna(df['aa']).fillna(df['aaa'])
df = df.drop(['aa','aaa'],axis = 1)
df['b'] = df['b'].fillna(df['bb']).fillna(df['bbb'])
df = df.drop(['bb','bbb'],axis = 1)
And this works, but I want to know if there is a more elegant way to accomplish this without copy pasting this code block 14 times.
Upvotes: 2
Views: 514
Reputation: 4761
You can use pandas.DataFrame.groupby
with axis = 1
("columns"):
df.groupby(lambda x: x[0], axis = 1).sum()
a b c
0 2.0 1.0 3.0
1 1.0 2.0 1.0
2 3.0 1.0 2.0
If groupby is used with a function, it's called on each value of the object's index, in this case, the columns names.
Since you can group by any function, it can be a really flexible solution.
Upvotes: 1
Reputation: 862671
You can grouping by first letter in columns names with GroupBy.first
:
df = df.groupby(df.columns.str[0], axis=1).first()
print (df)
a b c
0 2.0 1.0 3.0
1 1.0 2.0 1.0
2 3.0 1.0 2.0
Upvotes: 0