Reputation: 373
Following off this question I am wanting a count of specific words or phrases per rows in a DataFrame that are extremely similar and such that there is a subset string to be searched that is considered unique. When expanding and diversifying the criteria of strings to be searched, I ran into trouble and was not able to clearly modify or generalize for the solution. The set of strings I'm looking for is as follows:
Set of Strings
'burgers',
'burgers.extreme',
'burgers legacy',
'burgers 2',
'burgers V2',
'greasy burgers',
'ultimate burgers',
'chewy burgers',
'rotten burgers'
Expected Output
Additional columns representing counts per each of those strings appended onto the DataFrame with the strings as their new, respective column names. This will show the counts of each of those strings per row.
Data: DataFrame column with "Text" as the column name:
Text
burgers V2 satisfies these requirements 1; burgers is sufficient for requirements 2; burgers legacy is on the road map for requirement 3; burgers answers requirements 4
burgers satisfies requirement 1; burgers is on the roadmap for requirement 2; burgers V2 satisfies requirement 3; burgers legacy satisfies requirement 4
burgers legacy satisfies requirements 1; burgers V2 satisfies requirements 2; burgers V2 satisfies requirements 3
greasy burgers do not meet requirements 1; burgers satisfies requirements 2; burgers.extreme meets requirement 3; ultimate burgers satisfies requirements 4
chewy burgers satisfies requirements 1; rotten burgers satisfies requirements 2; burgers legacy satisfies requirement 3; burgers legacy satisfies requirements 4
burgers 2 satisfies requirements 1; chewy burgers satisfies requirements 2; greasy burgers satisfies requirements 3; greasy burgers satisfies requirements 4
Issues to Take Note
df['burgers'] = df['Text'].str.count('burgers')
results in over-counting 'burgers'.'greasy ','ultimate ','chewy ','rotten '
or not immediately followed by '.extreme', ' legacy', ' 2', ' V2'
then it should be considered and counted as a unique item in that row.'.extreme'
indicating the item 'burgers.extreme'
is its own unique item and does not have a space like the others.What I attempted
Using Andrej's wonderful answer as a starting point, I attempted to modify it to a broader category of strings, but have been running into trouble with the regex and generalizing it. Such as trying:
tmp = df["Text"].str.findall(r"(greasy|ultimate|chewy|rotten)?\s*burgers\s*(legacy|V2|2)?").explode())
df = pd.concat([df, pd.crosstab(tmp.index, tmp).add_prefix("burgers ")], axis=1)
df.columns = map(str.strip, df.columns)
but adding (greasy|ultimate|chewy|rotten)?\s*
returns TypeError: Cannot broadcast np.ndarray with operand of type <class 'list'>
, nor does it make sense to label the columns with .add_prefix("burgers ")]
since there is now text being looked at at either side. It also just omits 'burgers.extreme'
as well.
I thought that perhaps doing it piece-wise might be the way to go, something like
tmp1 = df["Text"].str.findall(r"burgers\s*(legacy|V2|2)?").explode()
tmp2 = df["Text"].str.findall(r"(greasy|ultimate|chewy|rotten)?\s*burgers").explode()
tmp3 = df["Text"].str.findall(r"burgers.extreme").explode()
but I'm not able to merge or concatenate the three in a way that allows for the crosstab to work and returns a dimensionality error. It also seems like a very inefficient approach, as well.
What can I do to resolve this? Or what generalized approach can I take that allows for the search and counting of string per row where there is a unique subset of string characters as one of the elements?
Upvotes: 2
Views: 863
Reputation: 5590
Regular expression or
statements will short-circuit. Thus, as long as you put burgers
last, you will not experience double matching.
We can do this as follows:
targets = [
"burgers.extreme",
"burgers legacy",
"burgers 2",
"burgers V2",
"greasy burgers",
"ultimate burgers",
"chewy burgers",
"rotten burgers",
"burgers",
]
pattern = "|".join(f"({item})" for item in targets)
# (burgers.extreme)|(burgers legacy)|(burgers 2)|(burgers V2)|(greasy burgers)|(ultimate burgers)|(chewy burgers)|(rotten burgers)|(burgers)
and we can build up our answer as follows:
df[targets] = (
df["Text"]
.str.extractall(pattern)
.droplevel(axis=0, level=1)
.reset_index()
.groupby("index")
.count()
)
Which gives:
Text burgers.extreme burgers legacy burgers 2 burgers V2 greasy burgers ultimate burgers chewy burgers rotten burgers burgers
0 burgers V2 satisfies these requirements 1; bur... 0 1 0 1 0 0 0 0 2
1 burgers satisfies requirement 1; burgers is o... 0 1 0 1 0 0 0 0 2
2 burgers legacy satisfies requirements 1; burge... 0 1 0 2 0 0 0 0 0
3 greasy burgers do not meet requirements 1; bur... 1 0 0 0 1 1 0 0 1
4 chewy burgers satisfies requirements 1; rotten... 0 2 0 0 0 0 1 1 0
5 burgers 2 satisfies requirements 1; chewy burg... 0 0 1 0 2 0 1 0 0
So what's going on here:
extractall
adds a new row for each match (notably adding a multi-index where the first index represents the index in the original df
, and the second represents the index of the match). This row is all NaN
except for the matching value.extractall
is adding an index for match number. We don't need this, we just care about what row the match was in in the original df
, thus we drop the index of the match using .droplevel(axis=0, level=1)
groupby
, we reset the indexindex
, which is the index from the original dataframeAnd that gives you your result.
Regex also will find the earliest matches. As an example, this means that regardless of the ordering of your targets
, if your text contains greasy burgers V2
, the regular expression will find greasy burgers
, not burgers
and not burgers V2
.
Upvotes: 1