TYL
TYL

Reputation: 9

Count number of row that contain two exact strings

This is my df1

df1 = pd.DataFrame(
    [
        ["apple,orange,milk"],
        ["orange,watermelon,apple"],
        ["milk,banana,apple"]
    ], 
    columns=['fruits']
)

df1

0 apple,orange,milk
1 orange,watermelon,apple
2 milk,banana,apple

This is my df2

df2 = pd.DataFrame(["apple","orange","banana"], columns=['fruits'])

df2

0 apple
1 orange
2 banana

I want to find number of row that the two exact strings occur together. For example, count number of row when apple and milk occur together in the row This is my code

for i,row in df2.iterrows():
    for j,rows in df1.iterrows():
        b = (rows.str.contains('(?:\s|\S|[,;])milk(?:\s|\S|[,;])') & rows.str.contains('(?:\s|\S|[,;])+df2.iloc[i]+(?:\s|\S|[,;])')).sum()
        if b>0:
            c=c+1
    print(c)

The output I got from here is always 0

0
0
0

The output supposed to be:

2
1
1

Upvotes: 0

Views: 100

Answers (2)

b-fg
b-fg

Reputation: 4137

First, as posted by @ifly6, your question needs to fix the dataframe creation.

Secondly, I assume (differently from the other answer) that you want to find how many contiguous strings as defined in df2 appear for each df1 row. A solution can be to first create the possible contiguous strings from df2 and then iterate through df1 to see if there is any match and how many words does the match contains. For example,

import pandas as pd
import itertools

def contiguous_indices(xs):
    n = len(xs)
    indices = list(range(n+1))
    for i,j in itertools.combinations(indices,2):
        yield xs[i:j]

df1=pd.DataFrame(["apple,orange,milk","orange,watermelon,apple","milk,banana,apple"])
df2=pd.DataFrame(["apple","orange","banana"])

# Define the list of possible contiguous strings in df2
s_list = []
for indx_list in contiguous_indices(range(df2[0].size)):
    s = ''
    for indx in indx_list:
        s += df2[0][indx] + ','
    s_list.append(s[:-1])
print(s_list) 
# ['apple', 'apple,orange', 'apple,orange,banana', 'orange', 'orange,banana', 'banana']

# Iterate through df1 and count max number of contiguous strings matches
for i, s1 in df1.iterrows():
    c_max = 0
    s_save = ''
    for s in s_list:
        if s in s1[0] and len(s.split(',')) > c_max:
            c_max = len(s.split(','))
            s_save = s
    print(i, c_max, s_save)

The output will be:

0 2 apple,orange
1 1 apple
2 1 apple

Upvotes: 0

ifly6
ifly6

Reputation: 5331

First, your constructor for the DataFrame doesn't work because it's misspelt and because it provides the wrong input. Correcting to:

df1 = pd.DataFrame(["apple,orange,milk", "orange,watermelon,apple", "milk,banana,apple"])
df2 = pd.DataFrame(["apple", "orange", "banana"])

Second, your question is unclear. If I were to rephrase it, I would say something along the lines of: 'I want to find the count of times that a two search terms appear in the same cell, over a set of search terms'. I'm not 100pc that is more clear, however. That said...


Create a function which uses string contains taking two specific parameters (and the necessary items for identifying where it ought to search):

def find2(df, col, s1, s2):
    return sum(df[col].str.contains(s1) & df[col].str.contains(s2))

What this does is that it asks across the column, whether any row contains search term s1 or s2. Then it intersects the two and sums the number of results. Executing:

df2[0].apply(lambda i: find2(df1, 0, 'milk', i))
Out[10]: 
0    2
1    1
2    1
Name: 0, dtype: int64

Upvotes: 2

Related Questions