Reputation: 4842
I am new to pandas.
I have a dataset which looks like this:
Date_1 Hour_1 id_1 Date_2 Hour_2 id_2 Date_3 Hour_3 id_3
2019-12-04 00 ABC 2019-12-04 01 ABC 2019-12-04 02 ABC
2019-12-04 00 ABCD 2019-12-04 01 ABCD 2019-12-04 02 ABCD
2019-12-04 00 ABCDEF 2019-12-04 01 ABCDE 2019-12-04 02 ABCDEF
2019-12-04 03 ABCDEFG 2019-12-04 01 ABCDEFG 2019-12-04 02 ABCDEF
...
My goal
Is to check if id_1
exists in id_2
, id_3
. And create a new dataframe where the structure would be the following:
Date_1 Hour_1 id_1 Date_2 Hour_2 Exists Date_3 Hour_3 Exists
2019-12-04 00 ABC 2019-12-04 01 True 2019-12-04 02 True
2019-12-04 00 ABCD 2019-12-04 01 True 2019-12-04 02 True
2019-12-04 00 ABCDEF False 2019-12-04 02 True
2019-12-04 03 ABCDEFG 2019-12-04 01 True False
The problem which I have right now is that I don't know how to include Date_2, Hour_2, Date_3, Hour_3 or exclude them depending if the id_2 and id_3 is True or False.
When I am creating my dataframe I simply add all each source of the information ( Date, Hour, id ) and I get large dataframe where I have Date_1-10, Hour_1-10, id_1-10.
final_export['Exists in id_2'] = final_data['id_1'].isin(final_data['id_2'])
final_export['Date from id_2'] = final_data['Date from id_2 other source']
final_export['Hour from id_2'] = final_data['Hour from id_2 other source']
When I use .isin()
method it filters the data correctly but it does not change if the hour and date from the same row is included or not. For example if id_1 exists in id_3 I would have True and its date and hour, if it does not exists I would have False and date with hour would be empty.
At the moment when I use .isin()
date and hour are not linked to the id_ value.
Let me know if the problem is explained correctly.
Thank you for your suggestions.
Upvotes: 1
Views: 116
Reputation: 3280
Something like that should work:
mask_id2 = df.id_1 == df.id_2
mask_id3 = df.id_1 == df.id_3
df.id_2 = mask_id2
df.id_3 = mask_id3
df.loc[~mask_id2, ['Date_2', 'Hour_2']] = ""
df.loc[~mask_id3, ['Date_3', 'Hour_3']] = ""
Output:
Date_1 Hour_1 id_1 Date_2 Hour_2 id_2 Date_3 Hour_3 id_3
0 2019-12-04 0 ABC 2019-12-04 1 True 2019-12-04 2 True
1 2019-12-04 0 ABCD 2019-12-04 1 True 2019-12-04 2 True
2 2019-12-04 0 ABCDEF False 2019-12-04 2 True
3 2019-12-04 3 ABCDEFG 2019-12-04 1 True False
Upvotes: 2
Reputation: 5728
If Iron Hand's answer isn't what you were after this will give you a df in the format you had -
import pandas as pd
final_data = pd.DataFrame({
"Date_1": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_1": ["00", "00", "00", "03"],"id_1": ["ABC", "ABCD", "ABCDEF", "ABCDEFG"],
"Date_2": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_2": ["01", "01", "01", "01"],"id_2": ["ABC", "ABCD", "ABCDE", "ABCDEFG"],
"Date_3": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_3": ["02", "02", "02", "02"],"id_3": ["ABC", "ABCD", "ABCDEF", "ABCDEF"],
})
final_data['Exists in id_2'] = final_data['id_1'].isin(final_data['id_2'])
final_data['Exists in id_3'] = final_data['id_1'].isin(final_data['id_3']) final_data['Date_2']=final_data.apply(lambda r: r['Date_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Hour_2']=final_data.apply(lambda r: r['Hour_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Date_2']=final_data.apply(lambda r: r['Date_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Date_3']=final_data.apply(lambda r: r['Date_3'] if r['Exists in id_3'] is True else '',axis=1)
final_data['Hour_3']=final_data.apply(lambda r: r['Hour_3'] if r['Exists in id_3'] is True else '',axis=1)
print(final_data[['id_1','id_2','id_3','Hour_2','Hour_3']])
which gives a df with all the original info except id2, hour2 removed when id_2 is not in id1 and likewise for id3. Selected rows look like -
id_1 id_2 id_3 Hour_2 Date_2 Hour_3 Date_3
0 ABC ABC ABC 01 2019-12-04 02 2019-12-04
1 ABCD ABCD ABCD 01 2019-12-04 02 2019-12-04
2 ABCDEF ABCDE ABCDEF 02 2019-12-04
3 ABCDEFG ABCDEFG ABCDEF 01 2019-12-04
Upvotes: 0
Reputation: 430
try
df = pd.DataFrame({
"Date_1": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
"Hour_1": ["00", "00", "00", "03"],
"id_1": ["ABC", "ABCD", "ABCDEF", "ABCDEFG"],
"Date_2": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
"Hour_2": ["01", "01", "01", "01"],
"id_2": ["ABC", "ABCD", "ABCDE", "ABCDEFG"],
"Date_3": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
"Hour_3": ["02", "02", "02", "02"],
"id_3": ["ABC", "ABCD", "ABCDEF", "ABCDEF"],
})
ids = df["id_1"]
# You can choose whichever columns you want
df_1 = df.loc[df["id_1"].isin(ids), ["Date_1", "Hour_1", "id_1"]]
df_2 = df.loc[df["id_2"].isin(ids), ["Date_2", "Hour_2", "id_2"]]
df_3 = df.loc[df["id_3"].isin(ids), ["Date_3", "Hour_3", "id_3"]]
df_concat = pd.concat([df_1, df_2, df_3], axis=1)
output
Date_1 Hour_1 id_1 Date_2 Hour_2 id_2 Date_3 Hour_3 id_3
0 2019-12-04 00 ABC 2019-12-04 01 ABC 2019-12-04 02 ABC
1 2019-12-04 00 ABCD 2019-12-04 01 ABCD 2019-12-04 02 ABCD
2 2019-12-04 00 ABCDEF NaN NaN NaN 2019-12-04 02 ABCDEF
3 2019-12-04 03 ABCDEFG 2019-12-04 01 ABCDEFG 2019-12-04 02 ABCDEF
Upvotes: 0
Reputation: 1574
If I understand your problem correctly, isin()
is the wrong function to use: it checks to see if a value of id_1
is anywhere in the id_2
or (id_3
) column: it doesn't check if id_1
is a substring of the value of id_2
from the same row. Try the following code:
import pandas as pd
testdf = pd.DataFrame({
"hour_1": ["00", "01"],
"id_1":["ABC", "ABC"],
"id_2":["ABCD", "AB"],
})
testdf["exists_in_2"] = testdf['id_1'].isin(testdf['id_2'])
testdf
To fix that bit first:
eltwise_contains = lambda frag, text: frag in text
testdf["exists_in_2"] = testdf[['id_1', 'id_2']].apply(lambda x : eltwise_contains(*x), axis = 1)
testdf
Next, your problem: setting days and hours to the empty string if id_1
doesn't exist in the values of id_2
and id_3
of the same row. We can use the same pattern as above: define a lambda expression that takes two inputs, then in the next line, extract two columns from the DataFrame, and apply a further lambda on that sub-DataFrame that passes an unpacked lambda-variable to the original lambda.
empty_string_if_false = lambda a_bool, val: val if a_bool else ""
testdf["hour_1"] = testdf[['exists_in_2', 'hour_1']].apply(lambda x : empty_string_if_false(*x), axis = 1)
testdf
Upvotes: 0
Reputation: 9
I would suggest dividing the data frame into three data frames, each having id,date,hour and use merge function to merge the dataframes with the id as a value and assign null values where the id is not present
Upvotes: 0