mchangun
mchangun

Reputation: 10322

Faster way to count number of string occurrences in a numpy array python

I have a numpy array of tuples:

trainY = np.array([('php', 'image-processing', 'file-upload', 'upload', 'mime-types'),
                   ('firefox',), ('r', 'matlab', 'machine-learning'),
                   ('c#', 'url', 'encoding'), ('php', 'api', 'file-get-contents'),
                   ('proxy', 'active-directory', 'jmeter'), ('core-plot',),
                   ('c#', 'asp.net', 'windows-phone-7'),
                   ('.net', 'javascript', 'code-generation'),
                   ('sql', 'variables', 'parameters', 'procedure', 'calls')], dtype=object)

I am given list of indices which subsets this np.array:

x = [0, 4]

and a string:

label = 'php'

I want to count the number of times the label 'php' occurs in this subset of the np.array. In this case, the answer would be 2.

Notes:

1) A label will only appear at most ONCE in a tuple and

2) The tuple can have length from 1 to 5.

3) Length of the list x is typically 7-50.

4) Length of trainY is approx 0.8mil

My current code to do this is:

sum([1 for n in x if label in trainY[n]])

This is currently a performance bottleneck of my program and I'm looking for a way to make it much faster. I think we can skip the loop over x and just do a vectorised looking up trainY like trainY[x] but I couldn't get something that worked.

Thank you.

Upvotes: 5

Views: 6947

Answers (3)

Ffisegydd
Ffisegydd

Reputation: 53698

I think using Counters may be a good option in this case.

from collections import Counter

c = Counter([i for j in trainY for i in j])

print c['php'] # Returns 2
print c.most_common(5) # Print the 5 most common items.

Upvotes: 6

Saullo G. P. Castro
Saullo G. P. Castro

Reputation: 58915

You can use np.in1d after flattening your array with a list comprehension:

trainY = np.array([i for j in trainY for i in j])
ans = np.in1d(trainY, 'php').sum()
# 2

Upvotes: 2

lev
lev

Reputation: 4127

Consider building a dictionary of the form:

{'string1': (1,2,5),
 'string2': (3,4,5),
 ...
}

for every word, hold a sorted list of the indices it appeared in the tuples. hope it makes sense...

Upvotes: 0

Related Questions