Reputation: 715
I have two dataframes df and df1. where I have to match sequence or strings and getting the only matching string sequence with index number of df as output.
df
idx id_0 user string
0 008457 02 hello
1 990037 05 I
2 774426 10 am
3 564389 08 sleeping
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
10 294260 16 today
11 908751 29 is
12 558902 81 rainy
13 097856 19 with
14 110044 24 cold
15 775098 16 today
16 665490 02 is
17 887099 07 sunday
18 389011 18 ahhh
19 675510 11 weekend
df1
idx string
0 today
1 is
2 a
3 bright
4 sunny
5 day
output:
idx id_0 user string
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
I tried with several methods pd.merge, pd.concat, pd.join, also with isin, but, I get the wrong index number.
e.g.,
out = df1[df1['string'].isin(df.index().['string'])]
Upvotes: 2
Views: 99
Reputation: 385
A possible way to do that is as follows:
df = pd.DataFrame([
[0, "008457", "02", "hello"],
[1, "990037", "05", "I"],
[2, "774426", "10", "am"],
[3, "564389", "08", "sleeping"],
[4, "009124", "17", "today"],
[5, "000029", "13", "is"],
[6, "548751", "21", "a"],
[7, "479903", "19", "bright"],
[8, "897054", "08", "sunny"],
[9, "336588", "7", "day"],
[10, "294260", "16", "today"],
[11, "908751", "29", "is"],
[12, "558902", "81", "rainy"],
[13, "097856", "19", "with"],
[14, "110044", "24", "cold"],
[15, "775098", "16", "today"],
[16, "665490", "02", "is"],
[17, "887099", "07", "sunday"],
[18, "389011", "18", "ahhh"],
[19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)
df = df.set_index('idx')
df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)
matching_indices = []
for i in range(len(df)-len(df1)+1):
if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
matching_indices += list(range(i,i+len(df1)))
df.iloc[matching_indices]
With output:
id_0 user string
idx
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
The above code will return all the matching subsequences with their correct indices, not only the first occurrence.
If you wish to return only the first occurrence you can break the loop the first time a match is identified, like below:
matching_indices = []
for i in range(len(df)-len(df1)+1):
if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
matching_indices += list(range(i,i+len(df1)))
break
df.iloc[matching_indices]
Upvotes: 2
Reputation: 3961
Convert df1.strings to list, and compare to df with apply and a lambda function:
import pandas as pd
df = pd.DataFrame([
[0, "008457", "02", "hello"],
[1, "990037", "05", "I"],
[2, "774426", "10", "am"],
[3, "564389", "08", "sleeping"],
[4, "009124", "17", "today"],
[5, "000029", "13", "is"],
[6, "548751", "21", "a"],
[7, "479903", "19", "bright"],
[8, "897054", "08", "sunny"],
[9, "336588", "7", "day"],
[10, "294260", "16", "today"],
[11, "908751", "29", "is"],
[12, "558902", "81", "rainy"],
[13, "097856", "19", "with"],
[14, "110044", "24", "cold"],
[15, "775098", "16", "today"],
[16, "665490", "02", "is"],
[17, "887099", "07", "sunday"],
[18, "389011", "18", "ahhh"],
[19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)
df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)
string_list = df1.string.tolist()
filt = df['string'].apply(lambda x: any([k in x for k in string_list]))
print(df[filt])
Returning:
idx id_0 user string
2 2 774426 10 am
4 4 009124 17 today
5 5 000029 13 is
6 6 548751 21 a
7 7 479903 19 bright
8 8 897054 08 sunny
9 9 336588 7 day
10 10 294260 16 today
11 11 908751 29 is
12 12 558902 81 rainy
15 15 775098 16 today
16 16 665490 02 is
17 17 887099 07 sunday
18 18 389011 18 ahhh
Upvotes: 0