pythonpandasdataframelogistic-regression

Reputation: 6213

Explode / unpivot DataFrame containing count data to one row per item

Context: Data transformation for a logistic regression problem. I have the following data structure:

df = pd.DataFrame({"group": ["A", "B"], "total": [3, 5], "occurrence": [2, 1]})

I want to do sth. like pd.explode, but creating one row for item of total, i.e. 5+6 rows where occurrence number of rows hold 1 and the rest 0 (either in the occurrence columns or a new target column).

Currently I'm doing it iteratively which is quite slow on large data:

expanded = []
for ix, row in df.iterrows():
    for i in range(row["total"]):
        row["y"] = 1 if i < row["occurrence"] else 0
        expanded.append(row.copy())
df_out = pd.DataFrame(expanded).reset_index(drop=True)
df_out.drop(["total", "occurrence"], axis=1, inplace=True)
df_out


  group  y
0     A  1
1     A  1
2     A  0
3     B  1
4     B  0
5     B  0
6     B  0
7     B  0

Upvotes: 1

Answers (4)

Soudipta Dutta

Reputation: 2122

import pandas as pd
import numpy as np


df = pd.DataFrame({"group": ["A", "B"], "total": [3, 5], "occurrence": [2, 1]})
print(df)
'''
  group  total  occurrence
0     A      3           2
1     B      5           1
'''
df_expanded = df.loc[df.index.repeat(df['total'])].reset_index(drop=True)

occurrence_repeated = np.repeat(df['occurrence'].values,df['total'].values)

aa = df_expanded.groupby('group').cumcount().to_numpy()

df_expanded['y'] = (aa < occurrence_repeated).astype(int)
print(df_expanded)
'''
  group  total  occurrence  y
0     A      3           2  1
1     A      3           2  1
2     A      3           2  0
3     B      5           1  1
4     B      5           1  0
5     B      5           1  0
6     B      5           1  0
7     B      5           1  0
'''

Upvotes: 0

PaulS

Reputation: 25298

I offer here 4 solutions. The fastest solution is Solution 4.

SOLUTION 1

Another possible solution, which proceeds according to the following steps:

It first repeats each row of the original dataframe according to the total values.
Then, it creates the y values in a single step with a list comprehension and np.concatenate, where for each row, it constructs an array consisting of occurrence ones followed by (total - occurrence) zeros using np.r_.

(df.loc[df.index.repeat(df['total']), ['group']]
    .assign(y=np.concatenate(
        [np.r_[np.ones(o, dtype=int), np.zeros(t - o, dtype=int)]
         for t, o in zip(df['total'], df['occurrence'])]))
    .reset_index(drop=True))

SOLUTION 2

As @mozway has correctly pointed out in a comment below, the used list comprehension impacts on performance. Thus, I am now proposing basically the same approach, but leveraged up by numba. Maybe the performance is boosted by numba.


out = df.loc[df.index.repeat(df['total']), ['group']].reset_index(drop=True)

@njit
def gen_y(total, occurrence):
    y_vals = np.empty(total.sum(), dtype=np.int32)
    idx = 0
    for t, o in zip(total, occurrence):
        y_vals[idx:idx + o] = 1
        y_vals[idx + o:idx + t] = 0
        idx += t
    return y_vals

out['y'] = gen_y(df['total'].values, df['occurrence'].values)

SOLUTION 3

@mozway suspects that the bottleneck is in pandas' repeat. If he is right, the following solution should exhibit greater performance:

@njit
def gen_y(totals, occurrences):
    total_rows = totals.sum()
    new_y = np.empty(total_rows, dtype=np.int32)
    idx = 0
    for t, o in zip(totals, occurrences):
        new_y[idx:idx + o] = 1
        new_y[idx + o:idx + t] = 0
        idx += t
    return new_y

y_vals = gen_y(df['total'].values, df['occurrence'].values)
groups = np.repeat(df['group'].values, df['total'].values)

out = pd.DataFrame({
    "group": groups,
    "y": y_vals
})

SOLUTION 4

The following solution is the fastest I have achieved:

@njit
def gen_y(total, occurrence):
    y_vals = np.zeros(total.sum(), dtype=np.int32)
    idx = 0
    for t, o in zip(total, occurrence):
        y_vals[idx:idx + o] = 1
        idx += t
    return y_vals

pd.DataFrame({
    "group": np.repeat(df['group'].values, df['total'].values),
    "y": gen_y(df['total'].values, df['occurrence'].values)
})

Performance (df with 20k rows):

# This solution 4
1.05 ms ± 92.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# Best @mozway's solution
1.42 ms ± 29 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Output:

Upvotes: 5

yellow_dot

Reputation: 176

I was thinking about a more numpy solution. Keeping your df as an input:

groups = {}
for group in df['group'].unique():
    y = np.zeros(df.loc[df[df['group'] == group].index, 'total'].iloc[0])
    idx = np.repeat(group, y.shape[0])
    y[:df.loc[df[df['group'] == group].index, 'occurrence'].iloc[0]] = 1
    y = np.array([idx,y]).T
    groups[group] = y
df2 = pd.DataFrame(np.concatenate([groups[group] for group in groups.keys()]), columns = ['group', 'y'])

you reach df2 as desired.

Upvotes: 0

mozway

Reputation: 260215

You could repeat the rows, then assign a new column based on the output from groupby.cumcount:

out = (df.loc[df.index.repeat(df['total']), ['group', 'occurrence']]
         .assign(y=lambda x: x.groupby(level=0).cumcount()
                              .lt(x.pop('occurrence'))
                              .astype(int))
      )

Or using numpy's tile/repeat:

import numpy as np

out = pd.DataFrame({'group': np.repeat(df['group'], df['total']),
                    'y': np.repeat(np.tile([1, 0], len(df)),
                                   np.r_[df['occurrence'],
                                         df['total']-df['occurrence']])
                    })

Output:

If you want to use explode:

out = (df.assign(y= lambda x: [[1 if i<o else 0 for i in range(t)] for
                               t, o in zip(x['total'], x['occurrence'])])
         .explode('y')
      )

Output:

  group  total  occurrence  y
0     A      3           2  1
0     A      3           2  1
0     A      3           2  0
1     B      5           1  1
1     B      5           1  0
1     B      5           1  0
1     B      5           1  0
1     B      5           1  0

timings

On 20k rows. The pure numpy approach is the fastest.

# original approach
9.96 s ± 248 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# repetition
7.17 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# numpy tile+repeat
1.31 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# explode
18.2 ms ± 959 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# numpy + list comprehension (PaulS)
125 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# numba (PaulS)
3.88 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 3

Explode / unpivot DataFrame containing count data to one row per item

Answers (4)

timings

Related Questions