Reputation: 23

How do I check if pandas df column value exists based on value in another column?

I have a pandas dataframe as below

Year	ID	Value
2016	1	100
2017	1	102
2017	1	105
2018	1	98
2016	2	121
2016	2	101
2016	2	133
2018	3	102

I want to check if the ID only exists for the year 2018 only. My desired output is as below:

Year	ID	Value	ID_only_in_2018
2016	1	100	0
2017	1	102	0
2017	1	105	0
2018	1	98	0
2016	2	121	0
2016	2	101	0
2016	2	133	0
2018	3	102	1

Please, how can I achieve this in python?

Upvotes: 2

Answers (4)

Thomas Kimber

Reputation: 11107

1st recreate the exampe:

import pandas as pd
data = [{"Year" : 2016, "ID" : 1, "Value" : 100},
       {"Year" : 2017, "ID" : 1, "Value" : 102},
       {"Year" : 2017, "ID" : 1, "Value" : 105},
       {"Year" : 2018, "ID" : 1, "Value" : 98},
       {"Year" : 2016, "ID" : 2, "Value" : 121},
       {"Year" : 2016, "ID" : 2, "Value" : 101},
       {"Year" : 2016, "ID" : 2, "Value" : 133},
       {"Year" : 2018, "ID" : 3, "Value" : 102}]

df = pd.DataFrame(data)

And take a close look at the aggregated data to see what's what.

df.groupby("ID")['Year'].apply(list)

Which returns the following:

ID 1    [2016, 2017, 2017, 2018] 
   2          [2016, 2016, 2016] 
   3                      [2018] Name: Year, dtype: object

So ID 1 appears in all 4 years, ID 2 appears 3 times but only ever with 2016, and ID 3 appears only once, for a single year, 2018.

Changing the aggregation function from list to set returns a slightly different view:

df.groupby("ID")['Year'].apply(set)

ID
1    {2016, 2017, 2018}
2                {2016}
3                {2018}
Name: Year, dtype: object

Which shows how ID 1 is spread across 3 years, and IDs 2 and 3 are spread against only a single year each.

If you want to test that a given ID then is associated with only one year, you could save and refer to one of these dictionaries, testing the values returned for a chosen year.

unique_lookup_set = dict(df.groupby("ID")['Year'].apply(set))

def contains_and_only_contains(value, collection):
    if value in collection and len(set(collection))==1:
         return 1
    else:
         return 0

Now try generating a series with the answers, to test it out:

df.apply(lambda x : contains_and_only_contains(2018, unique_lookup_set.get(x['ID'])), axis=1)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7     1
dtype: int

Good, and finally, include that series into the original dataframe to get the final output.

df['ID_only_in_2018']=df.apply(lambda x : contains_and_only_contains(2018, unique_lookup_set.get(x['ID'])), axis=1)

df

	Year	ID	Value	ID_only_in_2018
0	2016	1	100	0
1	2017	1	102	0
2	2017	1	105	0
3	2018	1	98	0
4	2016	2	121	0
5	2016	2	101	0
6	2016	2	133	0
7	2018	3	102	1

Upvotes: 0

Mustafa Aydın

Reputation: 18315

For each group of IDs, we can check if their unique Years are equal to [2018] with isin. Then we can map all IDs in the frame with that:

only_2018 = df.groupby("ID").Year.unique().isin([[2018]])
df["ID_only_in_2018"] = df.ID.map(only_2018).astype(int)

where the double brackets around 2018 is because unique gives back a list so we put the other one and astype is to convert True/False to 1/0.

to get

>>> df

   Year  ID  Value  ID_only_in_2018
0  2016   1    100                0
1  2017   1    102                0
2  2017   1    105                0
3  2018   1     98                0
4  2016   2    121                0
5  2016   2    101                0
6  2016   2    133                0
7  2018   3    102                1

only_2018 is:

ID
1    False
2    False
3     True

Upvotes: 0

Corralien

Reputation: 120559

You have to test in your group IDif year is unique and if this unique year is 2018:

df['ID_only_in_2018'] = df.groupby('ID')['Year'] \
                          .apply(lambda y: (y.nunique() == 1) &
                                (y == 2018)).astype(int)

>>> df
   Year  ID  Value  ID_only_in_2018
0  2016   1    100                0
1  2017   1    102                0
2  2017   1    105                0
3  2018   1     98                0
4  2016   2    121                0
5  2016   2    101                0
6  2016   2    133                0
7  2018   3    102                1

Upvotes: 0

jezrael

Reputation: 863671

Compare Year for 2018 and then test if all values are only 2018:

mask = df['Year'].eq(2018).groupby(df['ID']).transform('all')

Another idea is test if Year is not 2018, filter ID for not matched at least one non 2018 row and last invert mask by ~ for get only 2018 groups:

mask = ~df['ID'].isin(df.loc[df['Year'].ne(2018), 'ID'])

Last convert mask to integers:

df['ID_only_in_2018'] = mask.astype(int)

Or:

df['ID_only_in_2018'] = np.where(mask, 1, 0)

Or:

df['ID_only_in_2018'] = mask.view('i1')

print (df)
   Year  ID  Value  ID_only_in_2018
0  2016   1    100                0
1  2017   1    102                0
2  2017   1    105                0
3  2018   1     98                0
4  2016   2    121                0
5  2016   2    101                0
6  2016   2    133                0
7  2018   3    102                1

Upvotes: 2

How do I check if pandas df column value exists based on value in another column?

Answers (4)

Related Questions