Reputation: 27
I am trying to group the groups that share a common element in a Pandas dataframe, my dataframe looks like this (I added the empty lines just for some extra legibility):
ID REG
15 01497
15 01493
19 01706
19 01706-A
78 05710
78 01738
143 01626
143 01634
144 01626
144 01644
207 05255
207 01638
209 05255
209 03143
I expect to create a new column where I can grop all the IDs that share a same REG number, so it would look like this:
ID REG GROUP
15 01497 1
15 01493 1
19 01706 2
19 01706-A 2
78 05710 3
78 01738 3
143 01626 4
143 01634 4
144 01626 4
144 01644 4
207 05255 5
207 01638 5
209 05255 5
209 03143 5
There are some questions that address similar problems like this, but they are not quite the same, (Perhaps Pandas DataFrame Groupby two columns and get counts or How do I find common values within groups in pandas? or Numbering Groups In Pandas DataFrame) but I am not aiming to perform a sum or count in the traditional way.
Upvotes: 1
Views: 141
Reputation: 11321
With df
your dataframe you could try the following:
df["GROUP"] = (
df.groupby("ID", sort=False)["REG"].transform("first").to_frame()
.groupby("REG", sort=False).ngroup().add(1)
)
ID
and pick the resp. first REG
item for each ID
-group.REG
column onto group numbers.Result for the sample in the question:
ID REG GROUP
0 15 01497 1
1 15 01493 1
2 19 01706 2
3 19 01706-A 2
4 78 05710 3
5 78 01738 3
6 143 01626 4
7 143 01634 4
8 144 01626 4
9 144 01644 4
10 207 05255 5
11 207 01638 5
12 209 05255 5
13 209 03143 5
Upvotes: 1
Reputation: 726
Assuming you have a dataframe df with ID as your index, try:
index_lookup = df.index.unique()
df['GROUP'] = df.apply(lambda x: index_lookup.get_loc(x.index), axis=1)
print(df)
if ID is not index, you can either make it the index using:
df = df.set_index('Gender', inplace=True)
or if you do not wish to change the index use:
index_lookup = list(df['Gender'].unique())
df['GROUP'] = df.apply(lambda x : index_lookup.index(x['Gender']), axis=1)
That should work
Upvotes: 0
Reputation: 37737
One way is to use pandas.Series.duplicated
with pandas.Series.cumcum
:
df["GROUP"] = (~df["ID"].duplicated() ^ df["REG"].duplicated()).cumsum()
Output :
print(df)
ID REG GROUP
0 15 01497 1
1 15 01493 1
2 19 01706 2
3 19 01706-A 2
4 78 05710 3
5 78 01738 3
6 143 01626 4
7 143 01634 4
8 144 01626 4
9 144 01644 4
10 207 05255 5
11 207 01638 5
12 209 05255 5
13 209 03143 5
Upvotes: 1