Reputation: 326

Keywords count in df: Performance tuning

I have a df with a text column. Say:

d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]}
df = pd.DataFrame(data=d)

I also have a list with ~400 keywords, for example:

observations_words_list = ["bent","block","broken"]

I wish to see how many records has more than one keyword in their text, and I'm doing this like that:

    df = df['text'].dropna().reset_index()
    df_len = len(df['text'])
    obs = set(observations_words_list)
    df['Observations'] = df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs)))
    obs_count = len(df[df['Observations'] > 0])/df_len

For the sample df (in reality I read csv with ~0.5m records), I expect the new column to hold 1 for the first record, 0 for the second, and overall obs_count=0.5

Runtime is far from ideal, and I'm looking for a faster way to process this step.

Would love your ideas. Thanks!

Upvotes: 1

Answers (3)

Daniel Redgate

Reputation: 249

290ms for 100k * df

I understand you just want to see how many records contain ONE OR more keyword, and the absolute count isnt relevant. In which case there is an even better (and in my opinion more elegant) one line solution using regex which doesn't iterate over the data using apply:

d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]*100000}
df = pd.DataFrame(data=d)

observations_words_list = ["bent","block","broken"]
obs = '|'.join(r"\b{}\b".format(x) for x in observations_words_list)
contains_obs = df['text'].str.contains(obs, flags=re.IGNORECASE, regex=True)
obs_count = sum(contains_obs)/len(df['text'])

If you wish to keep the number of observations, you can use 'count' for a small time penalty (410ms):

df['Observations'] = df['text'].str.count(obs, flags=re.IGNORECASE)
obs_count = sum(df['Observations']>0)/len(df['text'])

Upvotes: 1

jezrael

Reputation: 863116

There is possible improvement with mean and .gt(0):

df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs))).gt(0).mean()

d = {'text': ["merry had a little lamb and a broken limb", "Little Jonathan found a chicken"]}
df = pd.DataFrame(data=d)
df = pd.concat([df] * 100000, ignore_index=True)


In [44]: %%timeit
    ...: df_len = len(df['text'])
    ...: obs = set(observations_words_list)
    ...: df['Observations'] = df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs)))
    ...: obs_count = len(df[df['Observations'] > 0])/df_len
    ...: 
    ...: 
448 ms ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [45]: %%timeit
    ...: obs = set(observations_words_list)
    ...: df['text'].apply(lambda x: len(set(str(x).lower().split()).intersection(obs))).mean()
    ...: 
    ...: 
324 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

U13-Forward

Reputation: 71600

You could try the following one-liner:

print(df['text'].str.lower().str.split(expand=True).isin(set(observations_words_list)).sum(axis=1).mean())

Output:

0.5

Upvotes: 1

Keywords count in df: Performance tuning

Answers (3)

Related Questions