FooBar
FooBar

Reputation: 16528

Pandas efficiently repeat rows

I know that typically replication of rows is horrible for performance, which is why most answers on Stackoverflow don't explain how to actually do it but suggest better alternatives - but for my use case, I need to actually do that.

I have a table with replication weights,

   id   some_value weight
    1            2      5
    2            A      2
    3            B      1
    4            3      3

where I need to repeat each row by the weight value. Think of a huge data frame. What would be a very efficient way to achieve this?

Expected output:

   id   some_value weight
    1            2      5
    1            2      5
    1            2      5
    1            2      5
    1            2      5
    2            A      2
    2            A      2
    3            B      1
    4            3      3
    4            3      3
    4            3      3

Upvotes: 1

Views: 1094

Answers (3)

Panwen Wang
Panwen Wang

Reputation: 3835

It's something like the uncount in tidyr:

https://tidyr.tidyverse.org/reference/uncount.html

I wrote a package (https://github.com/pwwang/datar) that implements this API:

from datar import f
from datar.tibble import tibble
from datar.tidyr import uncount

df = tibble(
  id=range(1,5),
  some_value=[2,'A','B',3],
  weight=[5,2,1,3]
)
df >> uncount(f.weight, _remove=False)

Output:

   id some_value  weight
0   1          2       5
0   1          2       5
0   1          2       5
0   1          2       5
0   1          2       5
1   2          A       2
1   2          A       2
2   3          B       1
3   4          3       3
3   4          3       3
3   4          3       3

Upvotes: 0

Zero
Zero

Reputation: 77027

Here are two ways

1) Using set_index and repeat

In [1070]: df.set_index(['id', 'some_value'])['weight'].repeat(df['weight']).reset_index()
Out[1070]:
    id some_value  weight
0    1          2       5
1    1          2       5
2    1          2       5
3    1          2       5
4    1          2       5
5    2          A       2
6    2          A       2
7    3          B       1
8    4          3       3
9    4          3       3
10   4          3       3

2) Using .loc and .repeat

In [1071]: df.loc[df.index.repeat(df.weight)].reset_index(drop=True)
Out[1071]:
    id some_value  weight
0    1          2       5
1    1          2       5
2    1          2       5
3    1          2       5
4    1          2       5
5    2          A       2
6    2          A       2
7    3          B       1
8    4          3       3
9    4          3       3
10   4          3       3

Details

In [1072]: df
Out[1072]:
   id some_value  weight
0   1          2       5
1   2          A       2
2   3          B       1
3   4          3       3

Upvotes: 2

user308827
user308827

Reputation: 22031

Perhaps treat it like a weighted array:

def weighted_array(arr, weights):
     zipped = zip(arr, weights)
     weighted_arr = []
     for i in zipped:
         for j in range(i[1]):
             weighted_arr.append(i[0])
     return weighted_arr

The returned weighted_arr will have each element in arr, repeated 'weights' number of times.

Upvotes: 0

Related Questions