Using regex and python to count string elements in DataFrame rows with a string subset

Question

Following off this question I am wanting a count of specific words or phrases per rows in a DataFrame that are extremely similar and such that there is a subset string to be searched that is considered unique. When expanding and diversifying the criteria of strings to be searched, I ran into trouble and was not able to clearly modify or generalize for the solution. The set of strings I'm looking for is as follows:

Set of Strings

'burgers',
'burgers.extreme',
'burgers legacy',
'burgers 2',
'burgers V2',
'greasy burgers',
'ultimate burgers',
'chewy burgers',
'rotten burgers'

Expected Output

Additional columns representing counts per each of those strings appended onto the DataFrame with the strings as their new, respective column names. This will show the counts of each of those strings per row.

Data: DataFrame column with "Text" as the column name:

Text
burgers V2 satisfies these requirements 1; burgers is sufficient for requirements 2; burgers legacy is on the road map for  requirement 3; burgers answers requirements 4
burgers  satisfies requirement 1; burgers is on the roadmap for requirement 2; burgers V2  satisfies requirement 3; burgers legacy satisfies requirement 4
burgers legacy satisfies requirements 1; burgers V2 satisfies requirements 2; burgers V2 satisfies requirements 3
greasy burgers do not meet requirements 1; burgers satisfies requirements 2; burgers.extreme meets requirement 3; ultimate burgers satisfies requirements 4
chewy burgers satisfies requirements 1; rotten burgers satisfies requirements 2; burgers legacy satisfies requirement 3; burgers legacy satisfies requirements 4
burgers 2 satisfies requirements 1; chewy burgers satisfies requirements 2; greasy burgers satisfies requirements 3; greasy burgers satisfies requirements 4

Issues to Take Note

As in the other question linked above, the obvious problem is that the string 'burgers' is a subset of the other strings. So doing something like df['burgers'] = df['Text'].str.count('burgers') results in over-counting 'burgers'.
To delineate and count 'burgers' as its own entity I believe I have to rely on what is not immediately surrounding that string. ie if 'burgers is not immediately preceded by 'greasy ','ultimate ','chewy ','rotten ' or not immediately followed by '.extreme', ' legacy', ' 2', ' V2' then it should be considered and counted as a unique item in that row.
All the above prefixes and suffixes indicating unique items contain a space (white space) except '.extreme' indicating the item 'burgers.extreme' is its own unique item and does not have a space like the others.

What I attempted

Using Andrej's wonderful answer as a starting point, I attempted to modify it to a broader category of strings, but have been running into trouble with the regex and generalizing it. Such as trying:

tmp = df["Text"].str.findall(r"(greasy|ultimate|chewy|rotten)?\s*burgers\s*(legacy|V2|2)?").explode())
df = pd.concat([df, pd.crosstab(tmp.index, tmp).add_prefix("burgers ")], axis=1)
df.columns = map(str.strip, df.columns)

but adding (greasy|ultimate|chewy|rotten)?\s* returns TypeError: Cannot broadcast np.ndarray with operand of type , nor does it make sense to label the columns with .add_prefix("burgers ")] since there is now text being looked at at either side. It also just omits 'burgers.extreme' as well.

I thought that perhaps doing it piece-wise might be the way to go, something like

tmp1 = df["Text"].str.findall(r"burgers\s*(legacy|V2|2)?").explode()
tmp2 = df["Text"].str.findall(r"(greasy|ultimate|chewy|rotten)?\s*burgers").explode()
tmp3 = df["Text"].str.findall(r"burgers.extreme").explode()

but I'm not able to merge or concatenate the three in a way that allows for the crosstab to work and returns a dimensionality error. It also seems like a very inefficient approach, as well.

What can I do to resolve this? Or what generalized approach can I take that allows for the search and counting of string per row where there is a unique subset of string characters as one of the elements?

Kraigolas · Accepted Answer

Regular expression or statements will short-circuit. Thus, as long as you put burgers last, you will not experience double matching.

We can do this as follows:

targets = [
    "burgers.extreme",
    "burgers legacy",
    "burgers 2",
    "burgers V2",
    "greasy burgers",
    "ultimate burgers",
    "chewy burgers",
    "rotten burgers",
    "burgers",
]

pattern = "|".join(f"({item})" for item in targets)
# (burgers.extreme)|(burgers legacy)|(burgers 2)|(burgers V2)|(greasy burgers)|(ultimate burgers)|(chewy burgers)|(rotten burgers)|(burgers)

and we can build up our answer as follows:

df[targets] = (
    df["Text"]
    .str.extractall(pattern)
    .droplevel(axis=0, level=1)
    .reset_index()
    .groupby("index")
    .count()
)

Which gives:

                                                Text  burgers.extreme  burgers legacy  burgers 2  burgers V2  greasy burgers  ultimate burgers  chewy burgers  rotten burgers  burgers
0  burgers V2 satisfies these requirements 1; bur...                0               1          0           1               0                 0              0               0        2
1  burgers  satisfies requirement 1; burgers is o...                0               1          0           1               0                 0              0               0        2
2  burgers legacy satisfies requirements 1; burge...                0               1          0           2               0                 0              0               0        0
3  greasy burgers do not meet requirements 1; bur...                1               0          0           0               1                 1              0               0        1
4  chewy burgers satisfies requirements 1; rotten...                0               2          0           0               0                 0              1               1        0
5  burgers 2 satisfies requirements 1; chewy burg...                0               0          1           0               2                 0              1               0        0

So what's going on here:

extractall adds a new row for each match (notably adding a multi-index where the first index represents the index in the original df, and the second represents the index of the match). This row is all NaN except for the matching value.
As I mentioned, extractall is adding an index for match number. We don't need this, we just care about what row the match was in in the original df, thus we drop the index of the match using .droplevel(axis=0, level=1)
To run our groupby, we reset the index
Now we group by index, which is the index from the original dataframe
Now we count the number of occurrences of each match for each index in the original dataframe

And that gives you your result.

Important caveat

Regex also will find the earliest matches. As an example, this means that regardless of the ordering of your targets, if your text contains greasy burgers V2, the regular expression will find greasy burgers, not burgers and not burgers V2.

Using regex and python to count string elements in DataFrame rows with a string subset

Answers (1)

Important caveat

Related Questions