Emmy
Emmy

Reputation: 23

How do I check if pandas df column value exists based on value in another column?

I have a pandas dataframe as below

Year ID Value
2016 1 100
2017 1 102
2017 1 105
2018 1 98
2016 2 121
2016 2 101
2016 2 133
2018 3 102

I want to check if the ID only exists for the year 2018 only. My desired output is as below:

Year ID Value ID_only_in_2018
2016 1 100 0
2017 1 102 0
2017 1 105 0
2018 1 98 0
2016 2 121 0
2016 2 101 0
2016 2 133 0
2018 3 102 1

Please, how can I achieve this in python?

Upvotes: 2

Views: 2509

Answers (4)

Thomas Kimber
Thomas Kimber

Reputation: 11107

1st recreate the exampe:

import pandas as pd
data = [{"Year" : 2016, "ID" : 1, "Value" : 100},
       {"Year" : 2017, "ID" : 1, "Value" : 102},
       {"Year" : 2017, "ID" : 1, "Value" : 105},
       {"Year" : 2018, "ID" : 1, "Value" : 98},
       {"Year" : 2016, "ID" : 2, "Value" : 121},
       {"Year" : 2016, "ID" : 2, "Value" : 101},
       {"Year" : 2016, "ID" : 2, "Value" : 133},
       {"Year" : 2018, "ID" : 3, "Value" : 102}]

df = pd.DataFrame(data)

And take a close look at the aggregated data to see what's what.

df.groupby("ID")['Year'].apply(list)

Which returns the following:

ID 1    [2016, 2017, 2017, 2018] 
   2          [2016, 2016, 2016] 
   3                      [2018] Name: Year, dtype: object

So ID 1 appears in all 4 years, ID 2 appears 3 times but only ever with 2016, and ID 3 appears only once, for a single year, 2018.

Changing the aggregation function from list to set returns a slightly different view:

df.groupby("ID")['Year'].apply(set)

ID
1    {2016, 2017, 2018}
2                {2016}
3                {2018}
Name: Year, dtype: object

Which shows how ID 1 is spread across 3 years, and IDs 2 and 3 are spread against only a single year each.

If you want to test that a given ID then is associated with only one year, you could save and refer to one of these dictionaries, testing the values returned for a chosen year.

unique_lookup_set = dict(df.groupby("ID")['Year'].apply(set))

def contains_and_only_contains(value, collection):
    if value in collection and len(set(collection))==1:
         return 1
    else:
         return 0

Now try generating a series with the answers, to test it out:

df.apply(lambda x : contains_and_only_contains(2018, unique_lookup_set.get(x['ID'])), axis=1)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7     1
dtype: int

Good, and finally, include that series into the original dataframe to get the final output.

df['ID_only_in_2018']=df.apply(lambda x : contains_and_only_contains(2018, unique_lookup_set.get(x['ID'])), axis=1)

df
Year ID Value ID_only_in_2018
0 2016 1 100 0
1 2017 1 102 0
2 2017 1 105 0
3 2018 1 98 0
4 2016 2 121 0
5 2016 2 101 0
6 2016 2 133 0
7 2018 3 102 1

Upvotes: 0

Mustafa Aydın
Mustafa Aydın

Reputation: 18315

For each group of IDs, we can check if their unique Years are equal to [2018] with isin. Then we can map all IDs in the frame with that:

only_2018 = df.groupby("ID").Year.unique().isin([[2018]])
df["ID_only_in_2018"] = df.ID.map(only_2018).astype(int)

where the double brackets around 2018 is because unique gives back a list so we put the other one and astype is to convert True/False to 1/0.

to get

>>> df

   Year  ID  Value  ID_only_in_2018
0  2016   1    100                0
1  2017   1    102                0
2  2017   1    105                0
3  2018   1     98                0
4  2016   2    121                0
5  2016   2    101                0
6  2016   2    133                0
7  2018   3    102                1

only_2018 is:

ID
1    False
2    False
3     True

Upvotes: 0

Corralien
Corralien

Reputation: 120559

You have to test in your group IDif year is unique and if this unique year is 2018:

df['ID_only_in_2018'] = df.groupby('ID')['Year'] \
                          .apply(lambda y: (y.nunique() == 1) &
                                (y == 2018)).astype(int)
>>> df
   Year  ID  Value  ID_only_in_2018
0  2016   1    100                0
1  2017   1    102                0
2  2017   1    105                0
3  2018   1     98                0
4  2016   2    121                0
5  2016   2    101                0
6  2016   2    133                0
7  2018   3    102                1

Upvotes: 0

jezrael
jezrael

Reputation: 863671

Compare Year for 2018 and then test if all values are only 2018:

mask = df['Year'].eq(2018).groupby(df['ID']).transform('all')

Another idea is test if Year is not 2018, filter ID for not matched at least one non 2018 row and last invert mask by ~ for get only 2018 groups:

mask = ~df['ID'].isin(df.loc[df['Year'].ne(2018), 'ID'])

Last convert mask to integers:

df['ID_only_in_2018'] = mask.astype(int)

Or:

df['ID_only_in_2018'] = np.where(mask, 1, 0)

Or:

df['ID_only_in_2018'] = mask.view('i1')

print (df)
   Year  ID  Value  ID_only_in_2018
0  2016   1    100                0
1  2017   1    102                0
2  2017   1    105                0
3  2018   1     98                0
4  2016   2    121                0
5  2016   2    101                0
6  2016   2    133                0
7  2018   3    102                1

Upvotes: 2

Related Questions