singularity2047
singularity2047

Reputation: 1071

Create new variable for grouped data using python

I have a data frame like this:

d = {'name': ['john', 'john', 'john', 'Tim', 'Tim', 'Tim','Bob', 'Bob'], 'Prod': ['101', '102', '101', '501', '505', '301', '302', '302'],'Qty': ['5', '4', '1', '3', '5', '4', '1', '3']}
df = pandas.DataFrame(data= d)

enter image description here

What I want to do is, create a new id variable. Whenever a name (say john) appears for the first time this id will be equal to 1, for other occurrence of the same name (john) this id variable will be 0. This will be done for all the other names in the data. How do I go about doing that ?

Final output should be like this:

enter image description here

NOTE: If someone knows SAS, there you can sort your data by the name and then use first.name.

       ""if first.variable = 1 then id = 1""

For first occurrence of same name first.name = 1. For any other repeat occurrence of same name, first.name = 0. I am trying to replicate the same in python.

So far I have tried pandas groupby and first() functionality and also numpy.where() but couldnt make any of that work. Any fresh perspective will be appreciated.

Upvotes: 0

Views: 139

Answers (2)

BENY
BENY

Reputation: 323326

You can using cumcount

s=df.groupby(['Prod','name']).cumcount().add(1)
df['counter']=s.mask(s.gt(1),0)
df
Out[1417]: 
  Prod Qty  name  counter
0  101   5  john        1
1  102   4  john        1
2  101   1  john        0
3  501   3   Tim        1
4  505   5   Tim        1
5  301   4   Tim        1
6  302   1   Bob        1
7  302   3   Bob        0

Update :

s=df.groupby(['name']).cumcount().add(1).le(1).astype(int)
s
Out[1421]: 
0    1
1    0
2    0
3    1
4    0
5    0
6    1
7    0
dtype: int32

More Fast

df.loc[df.name.drop_duplicates().index,'counter']=1
df.fillna(0)
Out[1430]: 
  Prod Qty  name  counter
0  101   5  john      1.0
1  102   4  john      0.0
2  101   1  john      0.0
3  501   3   Tim      1.0
4  505   5   Tim      0.0
5  301   4   Tim      0.0
6  302   1   Bob      1.0
7  302   3   Bob      0.0

Upvotes: 3

Primusa
Primusa

Reputation: 13498

We can just work directly with your dictionary d and loop through to create a new entry.

d = {'name': ['john', 'john', 'john', 'Tim', 'Tim', 'Tim','Bob', 'Bob'], 'Prod': ['101', '102', '101', '501', '505', '301', '302', '302'],'Qty': ['5', '4', '1', '3', '5', '4', '1', '3']}
names = set() #store names that have appeared
id = []
for i in d['name']:
    if i in names: #if it appeared add 0
         id.append(0)
    else:
         id.append(1) #add 1 and note that it has appeared
         names.add(i)
d['id'] = id #add entry to your dictionary
df = pandas.DataFrame(data= d)

Upvotes: 1

Related Questions