Reputation: 326
I have a df with a text column. Say:
d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]}
df = pd.DataFrame(data=d)
I also have a list with ~400 keywords, for example:
observations_words_list = ["bent","block","broken"]
I wish to see how many records has more than one keyword in their text, and I'm doing this like that:
df = df['text'].dropna().reset_index()
df_len = len(df['text'])
obs = set(observations_words_list)
df['Observations'] = df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs)))
obs_count = len(df[df['Observations'] > 0])/df_len
For the sample df (in reality I read csv with ~0.5m records), I expect the new column to hold 1 for the first record, 0 for the second, and overall obs_count=0.5
Runtime is far from ideal, and I'm looking for a faster way to process this step.
Would love your ideas. Thanks!
Upvotes: 1
Views: 94
Reputation: 249
290ms for 100k * df
I understand you just want to see how many records contain ONE OR more keyword, and the absolute count isnt relevant. In which case there is an even better (and in my opinion more elegant) one line solution using regex which doesn't iterate over the data using apply:
d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]*100000}
df = pd.DataFrame(data=d)
observations_words_list = ["bent","block","broken"]
obs = '|'.join(r"\b{}\b".format(x) for x in observations_words_list)
contains_obs = df['text'].str.contains(obs, flags=re.IGNORECASE, regex=True)
obs_count = sum(contains_obs)/len(df['text'])
If you wish to keep the number of observations, you can use 'count' for a small time penalty (410ms):
df['Observations'] = df['text'].str.count(obs, flags=re.IGNORECASE)
obs_count = sum(df['Observations']>0)/len(df['text'])
Upvotes: 1
Reputation: 863116
There is possible improvement with mean
and .gt(0)
:
df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs))).gt(0).mean()
d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]}
df = pd.DataFrame(data=d)
df = pd.concat([df] * 100000, ignore_index=True)
In [44]: %%timeit
...: df_len = len(df['text'])
...: obs = set(observations_words_list)
...: df['Observations'] = df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs)))
...: obs_count = len(df[df['Observations'] > 0])/df_len
...:
...:
448 ms ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %%timeit
...: obs = set(observations_words_list)
...: df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs))).mean()
...:
...:
324 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1
Reputation: 71600
You could try the following one-liner:
print(df['text'].str.lower().str.split(expand=True).isin(set(observations_words_list)).sum(axis=1).mean())
Output:
0.5
Upvotes: 1