Matching sequence of two dataframes with similar string parttern keeping index and sequence

Question

I have two dataframes df and df1. where I have to match sequence or strings and getting the only matching string sequence with index number of df as output.

df

idx id_0 user string
0 008457 02 hello
1 990037 05 I 
2 774426 10 am
3 564389 08 sleeping
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
10 294260 16 today 
11 908751 29 is 
12 558902 81 rainy
13 097856 19 with
14 110044 24 cold
15 775098 16 today 
16 665490 02 is
17 887099 07 sunday 
18 389011 18 ahhh
19 675510 11 weekend

df1

idx string
0 today
1 is
2 a
3 bright
4 sunny
5 day

output:

idx id_0 user string
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day

I tried with several methods pd.merge, pd.concat, pd.join, also with isin, but, I get the wrong index number.

e.g.,

out = df1[df1['string'].isin(df.index().['string'])]

sns · Accepted Answer

A possible way to do that is as follows:

df = pd.DataFrame([
    [0, "008457", "02", "hello"],
    [1, "990037", "05", "I"],
    [2, "774426", "10", "am"],
    [3, "564389", "08", "sleeping"],
    [4, "009124", "17", "today"],
    [5, "000029", "13", "is"],
    [6, "548751", "21", "a"],
    [7, "479903", "19", "bright"],
    [8, "897054", "08", "sunny"],
    [9, "336588", "7", "day"],
    [10, "294260", "16", "today"],
    [11, "908751", "29", "is"],
    [12, "558902", "81", "rainy"],
    [13, "097856", "19", "with"],
    [14, "110044", "24", "cold"],
    [15, "775098", "16", "today"],
    [16, "665490", "02", "is"],
    [17, "887099", "07", "sunday"],
    [18, "389011", "18", "ahhh"],
    [19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)
df = df.set_index('idx')

df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)


matching_indices = []
for i in range(len(df)-len(df1)+1):
    if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
        matching_indices += list(range(i,i+len(df1)))

df.iloc[matching_indices]

With output:

    id_0    user    string
idx         
4   009124  17  today
5   000029  13  is
6   548751  21  a
7   479903  19  bright
8   897054  08  sunny
9   336588  7   day

The above code will return all the matching subsequences with their correct indices, not only the first occurrence.

If you wish to return only the first occurrence you can break the loop the first time a match is identified, like below:

matching_indices = []
for i in range(len(df)-len(df1)+1):
    if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
        matching_indices += list(range(i,i+len(df1)))
        break

df.iloc[matching_indices]

Matching sequence of two dataframes with similar string parttern keeping index and sequence

Answers (2)

Related Questions