fulatoro
fulatoro

Reputation: 765

Pandas DataFrame Groupby to get Unique row condition and identify with increasing value up to Number of Groups

I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.

For example

A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1

I would like to generate the following

A B C D E  ID
0 1 1 0 1  0
0 1 2 0 1  1
0 1 1 1 1  0
0 1 3 0 1  2
0 1 2 1 0  1
0 1 1 2 1  0

I tried the following:

id = 0
def set_id(df):
    global id
    df['ID'] = id
    id += 1


df.groupby(['A','B','C']).transform(set_id)

This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values Thank you.

Upvotes: 2

Views: 1538

Answers (2)

Alex Fung
Alex Fung

Reputation: 2006

I think this is what you need :

df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index()  #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge

Upvotes: 1

Alexander
Alexander

Reputation: 109756

# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]

# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)

# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1

# Clean up.
del df['key']
df.sort_index(inplace=True)

>>> df
   A  B  C  D  E  ID
0  0  1  1  0  1   0
1  0  1  2  0  1   1
2  0  1  1  1  1   0
3  0  1  3  0  1   2
4  0  1  2  1  0   1
5  0  1  1  2  1   0

Upvotes: 0

Related Questions