Reputation: 6213
Context: Data transformation for a logistic regression problem. I have the following data structure:
df = pd.DataFrame({"group": ["A", "B"], "total": [3, 5], "occurrence": [2, 1]})
I want to do sth. like pd.explode
, but creating one row for item of total
, i.e. 5+6 rows where occurrence
number of rows hold 1
and the rest 0
(either in the occurrence
columns or a new target column).
Currently I'm doing it iteratively which is quite slow on large data:
expanded = []
for ix, row in df.iterrows():
for i in range(row["total"]):
row["y"] = 1 if i < row["occurrence"] else 0
expanded.append(row.copy())
df_out = pd.DataFrame(expanded).reset_index(drop=True)
df_out.drop(["total", "occurrence"], axis=1, inplace=True)
df_out
group y
0 A 1
1 A 1
2 A 0
3 B 1
4 B 0
5 B 0
6 B 0
7 B 0
Upvotes: 1
Views: 96
Reputation: 2122
import pandas as pd
import numpy as np
df = pd.DataFrame({"group": ["A", "B"], "total": [3, 5], "occurrence": [2, 1]})
print(df)
'''
group total occurrence
0 A 3 2
1 B 5 1
'''
df_expanded = df.loc[df.index.repeat(df['total'])].reset_index(drop=True)
occurrence_repeated = np.repeat(df['occurrence'].values,df['total'].values)
aa = df_expanded.groupby('group').cumcount().to_numpy()
df_expanded['y'] = (aa < occurrence_repeated).astype(int)
print(df_expanded)
'''
group total occurrence y
0 A 3 2 1
1 A 3 2 1
2 A 3 2 0
3 B 5 1 1
4 B 5 1 0
5 B 5 1 0
6 B 5 1 0
7 B 5 1 0
'''
Upvotes: 0
Reputation: 25298
I offer here 4 solutions. The fastest solution is Solution 4.
SOLUTION 1
Another possible solution, which proceeds according to the following steps:
It first repeats
each row of the original dataframe according to the total values.
Then, it creates the y
values in a single step with a list comprehension and np.concatenate
, where for each row, it constructs an array consisting of occurrence ones followed by (total - occurrence)
zeros using np.r_
.
(df.loc[df.index.repeat(df['total']), ['group']]
.assign(y=np.concatenate(
[np.r_[np.ones(o, dtype=int), np.zeros(t - o, dtype=int)]
for t, o in zip(df['total'], df['occurrence'])]))
.reset_index(drop=True))
SOLUTION 2
As @mozway has correctly pointed out in a comment below, the used list comprehension impacts on performance. Thus, I am now proposing basically the same approach, but leveraged up by numba
. Maybe the performance is boosted by numba
.
out = df.loc[df.index.repeat(df['total']), ['group']].reset_index(drop=True)
@njit
def gen_y(total, occurrence):
y_vals = np.empty(total.sum(), dtype=np.int32)
idx = 0
for t, o in zip(total, occurrence):
y_vals[idx:idx + o] = 1
y_vals[idx + o:idx + t] = 0
idx += t
return y_vals
out['y'] = gen_y(df['total'].values, df['occurrence'].values)
SOLUTION 3
@mozway suspects that the bottleneck is in pandas' repeat. If he is right, the following solution should exhibit greater performance:
@njit
def gen_y(totals, occurrences):
total_rows = totals.sum()
new_y = np.empty(total_rows, dtype=np.int32)
idx = 0
for t, o in zip(totals, occurrences):
new_y[idx:idx + o] = 1
new_y[idx + o:idx + t] = 0
idx += t
return new_y
y_vals = gen_y(df['total'].values, df['occurrence'].values)
groups = np.repeat(df['group'].values, df['total'].values)
out = pd.DataFrame({
"group": groups,
"y": y_vals
})
SOLUTION 4
The following solution is the fastest I have achieved:
@njit
def gen_y(total, occurrence):
y_vals = np.zeros(total.sum(), dtype=np.int32)
idx = 0
for t, o in zip(total, occurrence):
y_vals[idx:idx + o] = 1
idx += t
return y_vals
pd.DataFrame({
"group": np.repeat(df['group'].values, df['total'].values),
"y": gen_y(df['total'].values, df['occurrence'].values)
})
Performance (df
with 20k rows):
# This solution 4
1.05 ms ± 92.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# Best @mozway's solution
1.42 ms ± 29 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Output:
group y
0 A 1
1 A 1
2 A 0
3 B 1
4 B 0
5 B 0
6 B 0
7 B 0
Upvotes: 5
Reputation: 176
I was thinking about a more numpy
solution. Keeping your df
as an input:
groups = {}
for group in df['group'].unique():
y = np.zeros(df.loc[df[df['group'] == group].index, 'total'].iloc[0])
idx = np.repeat(group, y.shape[0])
y[:df.loc[df[df['group'] == group].index, 'occurrence'].iloc[0]] = 1
y = np.array([idx,y]).T
groups[group] = y
df2 = pd.DataFrame(np.concatenate([groups[group] for group in groups.keys()]), columns = ['group', 'y'])
you reach df2
as desired.
Upvotes: 0
Reputation: 260215
You could repeat
the rows, then assign
a new column based on the output from groupby.cumcount
:
out = (df.loc[df.index.repeat(df['total']), ['group', 'occurrence']]
.assign(y=lambda x: x.groupby(level=0).cumcount()
.lt(x.pop('occurrence'))
.astype(int))
)
import numpy as np
out = pd.DataFrame({'group': np.repeat(df['group'], df['total']),
'y': np.repeat(np.tile([1, 0], len(df)),
np.r_[df['occurrence'],
df['total']-df['occurrence']])
})
Output:
group y
0 A 1
0 A 1
0 A 0
1 B 1
1 B 0
1 B 0
1 B 0
1 B 0
If you want to use explode
:
out = (df.assign(y= lambda x: [[1 if i<o else 0 for i in range(t)] for
t, o in zip(x['total'], x['occurrence'])])
.explode('y')
)
Output:
group total occurrence y
0 A 3 2 1
0 A 3 2 1
0 A 3 2 0
1 B 5 1 1
1 B 5 1 0
1 B 5 1 0
1 B 5 1 0
1 B 5 1 0
On 20k rows. The pure numpy approach is the fastest.
# original approach
9.96 s ± 248 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# repetition
7.17 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# numpy tile+repeat
1.31 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# explode
18.2 ms ± 959 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# numpy + list comprehension (PaulS)
125 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# numba (PaulS)
3.88 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 3