Jonas Palačionis
Jonas Palačionis

Reputation: 4842

Adding value from a pandas dataframe only if other value from the same row is True

I am new to pandas.

I have a dataset which looks like this:

Date_1       Hour_1    id_1    Date_2       Hour_2    id_2    Date_3       Hour_3    id_3    
2019-12-04   00        ABC     2019-12-04   01        ABC     2019-12-04   02        ABC
2019-12-04   00        ABCD    2019-12-04   01        ABCD    2019-12-04   02        ABCD
2019-12-04   00        ABCDEF  2019-12-04   01        ABCDE   2019-12-04   02        ABCDEF
2019-12-04   03        ABCDEFG 2019-12-04   01        ABCDEFG 2019-12-04   02        ABCDEF
...

My goal

Is to check if id_1 exists in id_2, id_3. And create a new dataframe where the structure would be the following:

Date_1       Hour_1    id_1    Date_2       Hour_2    Exists   Date_3       Hour_3    Exists    
2019-12-04   00        ABC     2019-12-04   01        True     2019-12-04   02        True
2019-12-04   00        ABCD    2019-12-04   01        True     2019-12-04   02        True
2019-12-04   00        ABCDEF                         False    2019-12-04   02        True
2019-12-04   03        ABCDEFG 2019-12-04   01        True                            False

The problem which I have right now is that I don't know how to include Date_2, Hour_2, Date_3, Hour_3 or exclude them depending if the id_2 and id_3 is True or False.

When I am creating my dataframe I simply add all each source of the information ( Date, Hour, id ) and I get large dataframe where I have Date_1-10, Hour_1-10, id_1-10.

final_export['Exists in id_2'] = final_data['id_1'].isin(final_data['id_2'])
final_export['Date from id_2'] = final_data['Date from id_2 other source']
final_export['Hour from id_2'] = final_data['Hour from id_2 other source']

When I use .isin() method it filters the data correctly but it does not change if the hour and date from the same row is included or not. For example if id_1 exists in id_3 I would have True and its date and hour, if it does not exists I would have False and date with hour would be empty.

At the moment when I use .isin() date and hour are not linked to the id_ value.

Let me know if the problem is explained correctly.

Thank you for your suggestions.

Upvotes: 1

Views: 116

Answers (5)

fsl
fsl

Reputation: 3280

Something like that should work:

mask_id2 = df.id_1 == df.id_2
mask_id3 = df.id_1 == df.id_3

df.id_2 = mask_id2
df.id_3 = mask_id3

df.loc[~mask_id2, ['Date_2', 'Hour_2']] = ""
df.loc[~mask_id3, ['Date_3', 'Hour_3']] = ""

Output:

       Date_1  Hour_1     id_1      Date_2 Hour_2   id_2      Date_3 Hour_3   id_3
0  2019-12-04       0      ABC  2019-12-04      1   True  2019-12-04      2   True
1  2019-12-04       0     ABCD  2019-12-04      1   True  2019-12-04      2   True
2  2019-12-04       0   ABCDEF                     False  2019-12-04      2   True
3  2019-12-04       3  ABCDEFG  2019-12-04      1   True                     False

Upvotes: 2

jeremy_rutman
jeremy_rutman

Reputation: 5728

If Iron Hand's answer isn't what you were after this will give you a df in the format you had -

import pandas as pd

final_data = pd.DataFrame({
    "Date_1": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_1": ["00", "00", "00", "03"],"id_1": ["ABC", "ABCD", "ABCDEF", "ABCDEFG"],
    "Date_2": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_2": ["01", "01", "01", "01"],"id_2": ["ABC", "ABCD", "ABCDE", "ABCDEFG"],
    "Date_3": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_3": ["02", "02", "02", "02"],"id_3": ["ABC", "ABCD", "ABCDEF", "ABCDEF"],
})

final_data['Exists in id_2'] = final_data['id_1'].isin(final_data['id_2'])
final_data['Exists in id_3'] = final_data['id_1'].isin(final_data['id_3'])    final_data['Date_2']=final_data.apply(lambda r: r['Date_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Hour_2']=final_data.apply(lambda r: r['Hour_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Date_2']=final_data.apply(lambda r: r['Date_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Date_3']=final_data.apply(lambda r: r['Date_3'] if r['Exists in id_3'] is True else '',axis=1)
final_data['Hour_3']=final_data.apply(lambda r: r['Hour_3'] if r['Exists in id_3'] is True else '',axis=1)
print(final_data[['id_1','id_2','id_3','Hour_2','Hour_3']])

which gives a df with all the original info except id2, hour2 removed when id_2 is not in id1 and likewise for id3. Selected rows look like -

      id_1     id_2    id_3 Hour_2      Date_2 Hour_3      Date_3
0      ABC      ABC     ABC     01  2019-12-04     02  2019-12-04
1     ABCD     ABCD    ABCD     01  2019-12-04     02  2019-12-04
2   ABCDEF    ABCDE  ABCDEF                        02  2019-12-04
3  ABCDEFG  ABCDEFG  ABCDEF     01  2019-12-04              

Upvotes: 0

Iron Hand Odin
Iron Hand Odin

Reputation: 430

try

df = pd.DataFrame({
    "Date_1": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
    "Hour_1": ["00", "00", "00", "03"],
    "id_1": ["ABC", "ABCD", "ABCDEF", "ABCDEFG"],
    "Date_2": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
    "Hour_2": ["01", "01", "01", "01"],
    "id_2": ["ABC", "ABCD", "ABCDE", "ABCDEFG"],
    "Date_3": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
    "Hour_3": ["02", "02", "02", "02"],
    "id_3": ["ABC", "ABCD", "ABCDEF", "ABCDEF"],
})

ids = df["id_1"]
# You can choose whichever columns you want
df_1 = df.loc[df["id_1"].isin(ids), ["Date_1", "Hour_1", "id_1"]]
df_2 = df.loc[df["id_2"].isin(ids), ["Date_2", "Hour_2", "id_2"]]
df_3 = df.loc[df["id_3"].isin(ids), ["Date_3", "Hour_3", "id_3"]]

df_concat = pd.concat([df_1, df_2, df_3], axis=1)

output

Date_1  Hour_1  id_1    Date_2  Hour_2  id_2    Date_3  Hour_3  id_3
0   2019-12-04  00  ABC 2019-12-04  01  ABC 2019-12-04  02  ABC
1   2019-12-04  00  ABCD    2019-12-04  01  ABCD    2019-12-04  02  ABCD
2   2019-12-04  00  ABCDEF  NaN NaN NaN 2019-12-04  02  ABCDEF
3   2019-12-04  03  ABCDEFG 2019-12-04  01  ABCDEFG 2019-12-04  02  ABCDEF

Upvotes: 0

butterflyknife
butterflyknife

Reputation: 1574

If I understand your problem correctly, isin() is the wrong function to use: it checks to see if a value of id_1 is anywhere in the id_2 or (id_3) column: it doesn't check if id_1 is a substring of the value of id_2 from the same row. Try the following code:

import pandas as pd
testdf = pd.DataFrame({
    "hour_1": ["00", "01"],
    "id_1":["ABC", "ABC"], 
    "id_2":["ABCD", "AB"], 
})
testdf["exists_in_2"] = testdf['id_1'].isin(testdf['id_2'])
testdf

To fix that bit first:

eltwise_contains =  lambda frag, text: frag in text
testdf["exists_in_2"] = testdf[['id_1', 'id_2']].apply(lambda x : eltwise_contains(*x), axis = 1)

testdf

Next, your problem: setting days and hours to the empty string if id_1 doesn't exist in the values of id_2 and id_3 of the same row. We can use the same pattern as above: define a lambda expression that takes two inputs, then in the next line, extract two columns from the DataFrame, and apply a further lambda on that sub-DataFrame that passes an unpacked lambda-variable to the original lambda.

empty_string_if_false = lambda a_bool, val: val if a_bool else ""
testdf["hour_1"] = testdf[['exists_in_2', 'hour_1']].apply(lambda x : empty_string_if_false(*x), axis = 1)

testdf

Upvotes: 0

I would suggest dividing the data frame into three data frames, each having id,date,hour and use merge function to merge the dataframes with the id as a value and assign null values where the id is not present

Upvotes: 0

Related Questions