Reputation: 87
I Have a data frame with the column name
and I need to create the column seq
, which allows me identify the different times that a name appears in the data frame, it's important to preserve the order.
import pandas as pd
data = {'name': ['Tom', 'Joseph','Joseph','Joseph', 'Tom', 'Tom', 'John','Tom','Tom','John','Joseph']
, 'seq': ['Tom 0', 'Joseph 0','Joseph 0','Joseph 0', 'Tom 1', 'Tom 1', 'John 0','Tom 2','Tom 2','John 1','Joseph 1']}
df = pd.DataFrame(data)
print(df)
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
Upvotes: 2
Views: 508
Reputation: 30920
You need check for the existence of a new name and then create a new index for each name using groupby and cumsum, the resulting string Series can be concatenated with str.cat
df['seq'] = df['name'].str.cat(
df['name'].ne(df['name'].shift()).groupby(df['name']).cumsum().sub(1).astype(str),
sep=' '
)
Upvotes: 1
Reputation: 120391
Create a boolean mask to know if the name has changed from the previous row. Then filter out the second, third, ... names of a sequence before grouping by name. cumcount
increment the sequence number and finally concatenate name and sequence number.
# Boolean mask
m = df['name'].ne(df['name'].shift())
# Create sequence number
seq = df.loc[m].groupby('name').cumcount().astype(str) \
.reindex(df.index, fill_value=pd.NA).ffill()
# Concatenate name and seq
df['seq'] = df['name'] + ' ' + seq
Output:
>>> df
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
>>> m
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 False
9 True
10 True
Name: name, dtype: bool
Upvotes: 2
Reputation: 93141
Assuming your data frame is indexes sequentiallly (0, 1, 2, 3, ...):
name
def sequencer(group):
idx = group.index.to_series()
# Every time the index has a gap >1, create a new island
return idx.diff().ne(1).cumsum().sub(1)
seq = df.groupby('name').apply(sequencer).droplevel(0).rename('seq')
df.merge(seq, left_index=True, right_index=True)
Upvotes: 0