Reputation: 16528
I know that typically replication of rows is horrible for performance, which is why most answers on Stackoverflow don't explain how to actually do it but suggest better alternatives - but for my use case, I need to actually do that.
I have a table with replication weights,
id some_value weight
1 2 5
2 A 2
3 B 1
4 3 3
where I need to repeat each row by the weight value. Think of a huge data frame. What would be a very efficient way to achieve this?
Expected output:
id some_value weight
1 2 5
1 2 5
1 2 5
1 2 5
1 2 5
2 A 2
2 A 2
3 B 1
4 3 3
4 3 3
4 3 3
Upvotes: 1
Views: 1094
Reputation: 3835
It's something like the uncount
in tidyr
:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tibble
from datar.tidyr import uncount
df = tibble(
id=range(1,5),
some_value=[2,'A','B',3],
weight=[5,2,1,3]
)
df >> uncount(f.weight, _remove=False)
Output:
id some_value weight
0 1 2 5
0 1 2 5
0 1 2 5
0 1 2 5
0 1 2 5
1 2 A 2
1 2 A 2
2 3 B 1
3 4 3 3
3 4 3 3
3 4 3 3
Upvotes: 0
Reputation: 77027
Here are two ways
1) Using set_index
and repeat
In [1070]: df.set_index(['id', 'some_value'])['weight'].repeat(df['weight']).reset_index()
Out[1070]:
id some_value weight
0 1 2 5
1 1 2 5
2 1 2 5
3 1 2 5
4 1 2 5
5 2 A 2
6 2 A 2
7 3 B 1
8 4 3 3
9 4 3 3
10 4 3 3
2) Using .loc
and .repeat
In [1071]: df.loc[df.index.repeat(df.weight)].reset_index(drop=True)
Out[1071]:
id some_value weight
0 1 2 5
1 1 2 5
2 1 2 5
3 1 2 5
4 1 2 5
5 2 A 2
6 2 A 2
7 3 B 1
8 4 3 3
9 4 3 3
10 4 3 3
Details
In [1072]: df
Out[1072]:
id some_value weight
0 1 2 5
1 2 A 2
2 3 B 1
3 4 3 3
Upvotes: 2
Reputation: 22031
Perhaps treat it like a weighted array:
def weighted_array(arr, weights):
zipped = zip(arr, weights)
weighted_arr = []
for i in zipped:
for j in range(i[1]):
weighted_arr.append(i[0])
return weighted_arr
The returned weighted_arr will have each element in arr, repeated 'weights' number of times.
Upvotes: 0