Reputation: 71600
Let's say I have a DataFrame
:
>>> df = pd.DataFrame({'a1':[1,2],'a2':[3,4],'b1':[5,6],'b2':[7,8],'c':[9,0]})
>>> df
a1 a2 b1 b2 c
0 1 3 5 7 9
1 2 4 6 8 0
>>>
And I want to merge (maybe not merge, but concatenate) the columns where their name's first letter are equal, such as a1
and a2
and others... but as we see, there is a c
column which is by itself without any other similar ones, therefore I want them to not throw errors, instead add NaN
s to them.
I want to merge in a way that it will change a wide DataFrame
into a long DataFrame
, basically like a wide to long modification.
I already have a solution to the problem, but only thing is that it's very inefficient, I would like a more efficient and faster solution (unlike mine :P), I currently have a for
loop and a try
except
(ugh, sounds bad already) code such as:
>>> df2 = pd.DataFrame()
>>> for i in df.columns.str[:1].unique():
try:
df2[i] = df[[x for x in df.columns if x[:1] == i]].values.flatten()
except:
l = df[[x for x in df.columns if x[:1] == i]].values.flatten().tolist()
df2[i] = l + [pd.np.nan] * (len(df2) - len(l))
>>> df2
a b c
0 1 5 9.0
1 3 7 0.0
2 2 6 NaN
3 4 8 NaN
>>>
I would like to obtain the same results with better code.
Upvotes: 9
Views: 2396
Reputation: 402813
I'd recommend melt
, followed by pivot
. To resolve duplicates, you'll need to pivot on a cumcounted column.
u = df.melt()
u['variable'] = u['variable'].str[0] # extract the first letter
u.assign(count=u.groupby('variable').cumcount()).pivot('count', 'variable', 'value')
variable a b c
count
0 1.0 5.0 9.0
1 2.0 6.0 0.0
2 3.0 7.0 NaN
3 4.0 8.0 NaN
This can be re-written as,
u = df.melt()
u['variable'] = [x[0] for x in u['variable']]
u.insert(0, 'count', u.groupby('variable').cumcount())
u.pivot(*u)
variable a b c
count
0 1.0 5.0 9.0
1 2.0 6.0 0.0
2 3.0 7.0 NaN
3 4.0 8.0 NaN
If performance matters, here's an alternative with pd.concat
:
from operator import itemgetter
pd.concat({
k: pd.Series(g.values.ravel())
for k, g in df.groupby(operator.itemgetter(0), axis=1)
}, axis=1)
a b c
0 1 5 9.0
1 3 7 0.0
2 2 6 NaN
3 4 8 NaN
Upvotes: 5
Reputation: 863166
Use dictionary comprehension :
df = pd.DataFrame({i: pd.Series(x.to_numpy().ravel())
for i, x in df.groupby(lambda x: x[0], axis=1)})
print (df)
a b c
0 1 5 9.0
1 3 7 0.0
2 2 6 NaN
3 4 8 NaN
Upvotes: 3
Reputation: 4273
This solution gives a similar answer to cs95's and is two to three times faster.
grouping = df.columns.map(lambda s: int(s[1:]) if len(s) > 1 else 1)
df.columns = df.columns.str[0] # Make a copy if the original dataframe needs to be retained
result = pd.concat((g for _, g in df.groupby(grouping, axis=1)),
axis=0, ignore_index=True, sort=False)
Output
a b c
0 1 5 9.0
1 2 6 0.0
2 3 7 NaN
3 4 8 NaN
Upvotes: 1
Reputation: 29742
Using pd.concat
with pd.melt
and pd.groupby
:
pd.concat([d.T.melt(value_name=k)[k] for k, d in df.groupby(df.columns.str[0], 1)], 1)
Output:
a b c
0 1 5 9.0
1 3 7 0.0
2 2 6 NaN
3 4 8 NaN
Upvotes: 1
Reputation: 13255
Using rename
and groupby.apply
:
df = (df.rename(columns = dict(zip(df.columns, df.columns.str[:1])))
.groupby(level=0, axis=1, group_keys=False)
.apply(lambda x: pd.DataFrame(x.values.flat, columns=np.unique(x.columns))))
print(df)
a b c
0 1 5 9.0
1 3 7 0.0
2 2 6 NaN
3 4 8 NaN
Upvotes: 1
Reputation: 323316
I know this is not as good as using melt , but since this push into one-line, if you do need a faster solution try cs95's solution
df.groupby(df.columns.str[0],1).agg(lambda x : x.tolist()).sum().apply(pd.Series).T
Out[391]:
a b c
0 1.0 5.0 9.0
1 3.0 7.0 0.0
2 2.0 6.0 NaN
3 4.0 8.0 NaN
Upvotes: 2
Reputation: 150785
We can try groupby columns (axis=1
):
def f(g,a):
ret = g.stack().reset_index(drop=True)
ret.name = a
return ret
pd.concat( (f(g,a) for a,g in df.groupby(df.columns.str[0], axis=1)), axis=1)
output:
a b c
0 1 5 9.0
1 3 7 0.0
2 2 6 NaN
3 4 8 NaN
Upvotes: 3