Reputation: 529
I have a huge pandas
DataFrame df
, sorted by id
and then year
:
id gender year
3 male 1983
3 male 1983
3 male 1985
3 male 1990
6 female 1991
6 female 1992
7 male 1980
...
592873 female 1989
592873 female 1996
593001 male 2001
593428 female 2007
593428 female 2009
My goal is to create another column ca
which is computed by:
year
- minimum year
of that id
Hence, the first six rows of df
should return:
id gender year ca
3 male 1983 0
3 male 1983 0
3 male 1985 2
3 male 1990 7
6 female 1991 0
6 female 1992 1
(In other words, I'm searching for a Pythonic answer to this question.)
One solution I could think of is to make a list and use a for
loop:
ca_list = []
for i in range(len(df)):
if df['id'][i] != df['id'][i-1]:
num = df['year'][i]
ca_list.append(0)
else:
ca_list.append(df['year'][i] - num)
df['ca'] = ca_list
But I believe there is a more optimal way to devise this. Any insights are much appreciated.
Upvotes: 2
Views: 1413
Reputation: 195613
Try:
df["ca"] = df.groupby("id")["year"].transform(lambda x: x - x.min())
print(df)
Prints:
id gender year ca
0 3 male 1983 0
1 3 male 1983 0
2 3 male 1985 2
3 3 male 1990 7
4 6 female 1991 0
5 6 female 1992 1
Upvotes: 4