Reputation: 30605

How to speed up the sum of presence of keys in the series of documents? - Pandas, nltk

I have a dataframe column with documents like

38909    Hotel is an old style Red Roof and has not bee...
38913    I will never ever stay at this Hotel again. I ...
38914    After being on a bus for -- hours and finally ...
38918    We were excited about our stay at the Blu Aqua...
38922    This hotel has a great location if you want to...
Name: Description, dtype: object

I have a bag of words like keys = ['Hotel','old','finally'] but the actual length of keys = 44312

Currently Im using

df.apply(lambda x : sum([i in x for i in keys ]))

Which gives the following output based on sample keys

38909    2
38913    2
38914    3
38918    0
38922    1
Name: Description, dtype: int64

When I apply this on actual data for just 100 rows timeit gives

1 loop, best of 3: 5.98 s per loop

and I have 50000 rows. Is there a faster way of doing the same in nltk or pandas.

EDIT : In case looking for document array

array([ 'Hotel is an old style Red Roof and has not been renovated up to the new standard, but the price was also not up to the level of the newer style Red Roofs. So, in overview it was an OK stay, and a safe',
   'I will never ever stay at this Hotel again. I stayed there a few weeks ago, and I had my doubts the second I checked in. The guy that checked me in, I think his name was Julio, and his name tag read F',
   "After being on a bus for -- hours and finally arriving at the Hotel Lawerence at - am, I bawled my eyes out when we got to the room. I realize it's suppose to be a boutique hotel but, there was nothin",
   "We were excited about our stay at the Blu Aqua. A new hotel in downtown Chicago. It's architecturally stunning and has generally good reviews on TripAdvisor. The look and feel of the place is great, t",
   'This hotel has a great location if you want to be right at Times Square and the theaters. It was an easy couple of blocks for us to go to theater, eat, shop, and see all of the activity day and night '], dtype=object)

Upvotes: 1

Answers (3)

Leon

Reputation: 32494

The following code is not exactly equivalent to your (slow) version, but it demonstrates the idea:

keyset = frozenset(keys)
df.apply(lambda x : len(keyset.intersection(x.split())))

Differences/limitation:

In your version a word is counted even if it is contained as a substring in a word in the document. For example, had your keys contained the word tyl, it would be counted due to occurrence of "style" in your first document.
My solution doesn't account for punctuation in the documents. For example, the word again in the second document comes out of split() with the full stop attached to it. That can be fixed by preprocessing the document (or postprocessing the result of the split()) with a function that removes the punctuation.

Upvotes: 2

Divakar

Reputation: 221614

It seems you can just use np.char.count -

[np.count_nonzero(np.char.count(i, keys)) for i in arr]

Might be better to feed a boolean array for counting -

[np.count_nonzero(np.char.count(i, keys)!=0) for i in arr]

Upvotes: 2

jezrael

Reputation: 863031

If need check only if present values of list:

from numpy.core.defchararray import find

v = df['col'].values.astype(str)
a = (find(v[:, None], keys) >= 0).sum(axis=1)
print (a)
[2 1 1 0 0]

Or:

df = pd.concat([df['col'].str.contains(x) for x in keys], axis=1).sum(axis=1)
print (df)
38909    2
38913    1
38914    1
38918    0
38922    0
dtype: int64

Upvotes: 1

How to speed up the sum of presence of keys in the series of documents? - Pandas, nltk

Answers (3)

Related Questions