Reputation: 451
I have a dataframe df:
df = pd.DataFrame({
'id': ['1', '1', '2', '3', '3', '8','4', '1', '2', '4'],
'start': ['2017-01-02', '2017-02-01', '2017-03-01', '2017-02-01', '2017-03-01', '2017-04-01', '2017-01-01', '2017-04-01', '2017-05-01', '2017-02-01']
})
df.sort_values(['id', 'start'])
I wanted to see if for every id, there existss a second line. if it exist, i wanted to add a new column (lets say num_count) where I want to add integer 1. If there exists only one row per id or the last row of the id, I want to add 0. Here is what I want as output.
id num_count start
0 1 1 2017-01-02
1 1 1 2017-02-01
7 1 0 2017-04-01
2 2 1 2017-03-01
8 2 0 2017-05-01
3 3 1 2017-02-01
4 3 0 2017-03-01
6 4 1 2017-01-01
9 4 0 2017-02-01
5 8 0 2017-04-01
How can I do that?
Upvotes: 1
Views: 39
Reputation: 476659
You will need to assign the result of .sort_values(..)
given you want to process the sorted dataframe:
df = df.sort_values(['id', 'start'])
You can make use of the duplicated(..)
function [pandas-doc] here where, based on your sample output, you want the last item of the duplicated data to be False
.
So we can assign a column like:
df['num_count'] = df['id'].duplicated('last').astype(int)
This then yields for the sample input:
>>> df
id start
0 1 2017-01-02
1 1 2017-02-01
7 1 2017-04-01
2 2 2017-03-01
8 2 2017-05-01
3 3 2017-02-01
4 3 2017-03-01
6 4 2017-01-01
9 4 2017-02-01
5 8 2017-04-01
>>> df['num_count'] = df['id'].duplicated('last').astype(int)
>>> df
id start num_count
0 1 2017-01-02 1
1 1 2017-02-01 1
7 1 2017-04-01 0
2 2 2017-03-01 1
8 2 2017-05-01 0
3 3 2017-02-01 1
4 3 2017-03-01 0
6 4 2017-01-01 1
9 4 2017-02-01 0
5 8 2017-04-01 0
Upvotes: 3