Reputation: 91
I have a dataframe that the first column are strings, the 2nd column are the numbers that I want to replicate each string.
df = pd.DataFrame({'String':['a','b', 'c', 'd', 'e', 'f', 'g'],
'Times':[2, 3, 5, 3, 2, 4, 5]})
df
String | Times |
---|---|
a | 2 |
b | 3 |
c | 5 |
d | 3 |
e | 2 |
f | 4 |
g | 5 |
How can I create a data frame into this in python? (stopped at d but you know the pattern). Thanks!
String | Times |
---|---|
a | 1 |
a | 2 |
b | 1 |
b | 2 |
b | 3 |
c | 1 |
c | 2 |
c | 3 |
c | 4 |
c | 5 |
Upvotes: 1
Views: 1208
Reputation: 765
ss1=df.String.mul(df.Times).map(list).explode("")
ss2=ss1.groupby(ss1).cumcount().add(1)
pd.DataFrame(dict(String=ss1,Times=ss2))
Output:
String Times
0 a 1
1 a 2
2 b 1
3 b 2
4 b 3
5 c 1
6 c 2
7 c 3
8 c 4
9 c 5
10 d 1
11 d 2
12 d 3
13 e 1
14 e 2
15 f 1
16 f 2
17 f 3
18 f 4
19 g 1
20 g 2
21 g 3
22 g 4
23 g 5
Upvotes: 0
Reputation: 260640
Use Index.repeat
to replicate the rows and groupby
+cumcount
: to enumerate them:
(df.loc[df.index.repeat(df['Times'])]
.assign(Times=lambda d: d.groupby('String').cumcount().add(1))
)
Output:
String Times
0 a 1
0 a 2
1 b 1
1 b 2
1 b 3
2 c 1
2 c 2
2 c 3
2 c 4
2 c 5
3 d 1
...
@mozway
11.2 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
@Onyambu
16 ms ± 685 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@ifly6
27.3 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
@ddejohn
28.5 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
On 700k rows: 1s, 1.43s, 2.34s, 2.94s
in the same order
Upvotes: 2
Reputation: 79228
Another method could be:
df.assign(Times = df.Times.apply(lambda x: range(1, x + 1))).explode('Times')
Out[]:
String Times
0 a 1
0 a 2
1 b 1
1 b 2
1 b 3
2 c 1
2 c 2
2 c 3
2 c 4
2 c 5
Upvotes: 3
Reputation: 8962
A bit messy, but seems to do what you want:
new_df = pd.DataFrame(df.String.str.repeat(df.Times).apply(list).explode().reset_index(drop=True))
new_df["Times"] = new_df.groupby("String").cumcount() + 1
Output:
String Times
0 a 1
1 a 2
2 b 1
3 b 2
4 b 3
5 c 1
6 c 2
7 c 3
8 c 4
9 c 5
10 d 1
11 d 2
12 d 3
13 e 1
14 e 2
15 f 1
16 f 2
17 f 3
18 f 4
19 g 1
20 g 2
21 g 3
22 g 4
23 g 5
Upvotes: 3
Reputation: 5331
Use df.explode
. Make this work by converting your repetitions to list:
s = ['a','b', 'c', 'd', 'e', 'f', 'g']
t = [2, 3, 5, 3, 2, 4, 5]
Then,
times = [list(range(i + 1)) for i in t]
df = pd.DataFrame({'string': s, 'times': times}).explode('times')
>>> df.head(10) string times 0 a 0 0 a 1 0 a 2 1 b 0 1 b 1 1 b 2 1 b 3 2 c 0 2 c 1 2 c 2
If you don't want the index repeated like that, df.reset_index(drop=True, inplace=True)
(if in place, otherwise, reassign). If you want 1-indexing rather than 0-indexing, for list(range(i + 1))
substitute list(range(1, i + 2))
. Alternatively, just add one to the times
column at the end. The times
column doesn't really matter for df.explode
, it explodes based on repetitions anyway.
Upvotes: 1