Reputation: 23
I have a pandas dataframe as below
Year | ID | Value |
---|---|---|
2016 | 1 | 100 |
2017 | 1 | 102 |
2017 | 1 | 105 |
2018 | 1 | 98 |
2016 | 2 | 121 |
2016 | 2 | 101 |
2016 | 2 | 133 |
2018 | 3 | 102 |
I want to check if the ID only exists for the year 2018 only. My desired output is as below:
Year | ID | Value | ID_only_in_2018 |
---|---|---|---|
2016 | 1 | 100 | 0 |
2017 | 1 | 102 | 0 |
2017 | 1 | 105 | 0 |
2018 | 1 | 98 | 0 |
2016 | 2 | 121 | 0 |
2016 | 2 | 101 | 0 |
2016 | 2 | 133 | 0 |
2018 | 3 | 102 | 1 |
Please, how can I achieve this in python?
Upvotes: 2
Views: 2509
Reputation: 11107
1st recreate the exampe:
import pandas as pd
data = [{"Year" : 2016, "ID" : 1, "Value" : 100},
{"Year" : 2017, "ID" : 1, "Value" : 102},
{"Year" : 2017, "ID" : 1, "Value" : 105},
{"Year" : 2018, "ID" : 1, "Value" : 98},
{"Year" : 2016, "ID" : 2, "Value" : 121},
{"Year" : 2016, "ID" : 2, "Value" : 101},
{"Year" : 2016, "ID" : 2, "Value" : 133},
{"Year" : 2018, "ID" : 3, "Value" : 102}]
df = pd.DataFrame(data)
And take a close look at the aggregated data to see what's what.
df.groupby("ID")['Year'].apply(list)
Which returns the following:
ID 1 [2016, 2017, 2017, 2018]
2 [2016, 2016, 2016]
3 [2018] Name: Year, dtype: object
So ID 1 appears in all 4 years, ID 2 appears 3 times but only ever with 2016, and ID 3 appears only once, for a single year, 2018.
Changing the aggregation function from list
to set
returns a slightly different view:
df.groupby("ID")['Year'].apply(set)
ID
1 {2016, 2017, 2018}
2 {2016}
3 {2018}
Name: Year, dtype: object
Which shows how ID 1 is spread across 3 years, and IDs 2 and 3 are spread against only a single year each.
If you want to test that a given ID then is associated with only one year, you could save and refer to one of these dictionaries, testing the values returned for a chosen year.
unique_lookup_set = dict(df.groupby("ID")['Year'].apply(set))
def contains_and_only_contains(value, collection):
if value in collection and len(set(collection))==1:
return 1
else:
return 0
Now try generating a series with the answers, to test it out:
df.apply(lambda x : contains_and_only_contains(2018, unique_lookup_set.get(x['ID'])), axis=1)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
dtype: int
Good, and finally, include that series into the original dataframe to get the final output.
df['ID_only_in_2018']=df.apply(lambda x : contains_and_only_contains(2018, unique_lookup_set.get(x['ID'])), axis=1)
df
Year | ID | Value | ID_only_in_2018 | |
---|---|---|---|---|
0 | 2016 | 1 | 100 | 0 |
1 | 2017 | 1 | 102 | 0 |
2 | 2017 | 1 | 105 | 0 |
3 | 2018 | 1 | 98 | 0 |
4 | 2016 | 2 | 121 | 0 |
5 | 2016 | 2 | 101 | 0 |
6 | 2016 | 2 | 133 | 0 |
7 | 2018 | 3 | 102 | 1 |
Upvotes: 0
Reputation: 18315
For each group of ID
s, we can check if their unique Year
s are equal to [2018]
with isin
. Then we can map
all ID
s in the frame with that:
only_2018 = df.groupby("ID").Year.unique().isin([[2018]])
df["ID_only_in_2018"] = df.ID.map(only_2018).astype(int)
where the double brackets around 2018 is because unique
gives back a list so we put the other one and astype
is to convert True/False to 1/0.
to get
>>> df
Year ID Value ID_only_in_2018
0 2016 1 100 0
1 2017 1 102 0
2 2017 1 105 0
3 2018 1 98 0
4 2016 2 121 0
5 2016 2 101 0
6 2016 2 133 0
7 2018 3 102 1
only_2018
is:
ID
1 False
2 False
3 True
Upvotes: 0
Reputation: 120559
You have to test in your group ID
if year is unique and if this unique year is 2018:
df['ID_only_in_2018'] = df.groupby('ID')['Year'] \
.apply(lambda y: (y.nunique() == 1) &
(y == 2018)).astype(int)
>>> df
Year ID Value ID_only_in_2018
0 2016 1 100 0
1 2017 1 102 0
2 2017 1 105 0
3 2018 1 98 0
4 2016 2 121 0
5 2016 2 101 0
6 2016 2 133 0
7 2018 3 102 1
Upvotes: 0
Reputation: 863671
Compare Year for 2018
and then test if all values are only 2018
:
mask = df['Year'].eq(2018).groupby(df['ID']).transform('all')
Another idea is test if Year is not 2018
, filter ID
for not matched at least one non 2018
row and last invert mask by ~
for get only 2018
groups:
mask = ~df['ID'].isin(df.loc[df['Year'].ne(2018), 'ID'])
Last convert mask to integers:
df['ID_only_in_2018'] = mask.astype(int)
Or:
df['ID_only_in_2018'] = np.where(mask, 1, 0)
Or:
df['ID_only_in_2018'] = mask.view('i1')
print (df)
Year ID Value ID_only_in_2018
0 2016 1 100 0
1 2017 1 102 0
2 2017 1 105 0
3 2018 1 98 0
4 2016 2 121 0
5 2016 2 101 0
6 2016 2 133 0
7 2018 3 102 1
Upvotes: 2