Reputation: 1185

Is my code inefficient or is my task hard?

hope someone can help here?

I've written some code to perform a look-up task, and it's taking a long time to do it––far longer than it seems it should. For confidentiality reasons, I can't describe the exact task, but I can give one that's directly analogous.

Let's say that I have a database containing a list of words and a set of values corresponding to these words. Say, colours and how much these colours are liked.

Now, I have a text, and I want to compare all the words in this text to my database of colours, and where a match exists, I extract the 'liking' rating from the database and record it. When all words in the text have been matched (or not) with the database, the sum of the 'liking' ratings is my output.

An equivalent to the code I've written is below, and it works fine for the toy example I've given. However, my actual problem contains a database of 40k entries, and the text typically has about 500 words; most of the words in the text will be in the database. And when I run it, it takes hours to execute. I understand that matching 500 words against a database of 40k entries means on the order of 20m comparisons. Still, hours?

Can anyone suggest whether I'm doing a computationally intensive problem with limited hardware, or of my code is just massively inefficient?

Thanks!

import pandas as pd
import nltk as nltk

####### Creates toy data to test code on ###

colour = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']
values = [1,2,3,4,5,6,7]

df1 = pd.DataFrame({'colour': ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']})

df2 = pd.DataFrame({'value': [1,2,3,4,5,6,7]})

dfs = [df1, df2]

###### Code proper begins below

data = pd.concat(dfs, axis = 1) ######## Dataframe


#### Sample words 
words = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet', 'cyan', 'white', 'black', 'pink', 'grey', 'scarlet']


#### Lists into which words are put post-analysis
rating = 0
rated = []
unrated = []

####### Code

for i in words:
    for j in range(len(data['colour'])):
        if i == data['colour'][j]: 
            rating = rating + data['value'][j]
            rated.append(i)
            break
        elif i not in unrated and i not in data.values: ### Ensures each unrated word is entered only once.
            unrated.append(i)

Upvotes: 0

Answers (3)

alexis

Reputation: 50190

You need to index your keywords so that each word in your text can be looked up in O(1) time. If the dataset can fit in memory (and with 40k words it should be no problem), it can be as simple as this:

sentiment_index = dict(zip(colour, values))
rated = set()

for i in words:
    i = i.lower()
    rating += sentiment_index.get(i, 0)
    rated.add(i)

You could also write rating += sentiment_index[i], which is equally fast. But then you'd need an existence check, which I avoid by using get() with a default. And of course I added a set for the rated words. If you really do need to delegate lookups to a database, add an index to your dataframe to speed up lookups.

Upvotes: 2

Lodore66

Reputation: 1185

OK, I got some advice from someone offline that made a big difference here; it might be useful for anyone who comes looking with a similar problem. It seems my biggest bottleneck was in the line:

elif i not in unrated and i not in data.values:

This is because data.values is an array, and it takes time to search. Creating a variable

words = set(data1['Word'].values)

that turns the array into a set and putting that into the loop instead of the array dropped execution time from 1 hour+ to about 3 mins. That's still slow, but it's better than what it was.

Upvotes: 0

FDavidov

Reputation: 3675

Let's start with some maths... If you have 40,000 entries in your DB and you need to check 500 words (with no special apparent rules), you would need to perform 20,000,000 comparisons. This is NOT a small task.

Secondly, and assuming you are comparing strings (words), you are also doing some CASE manipulations (e.g. making sure that "RED" "Red", "REd", "ReD",..., "red" are considered the same value).

So, all in all, you should expect this to take few seconds at least. but not HOURS (unless you are using an 8086 processor :-)).

Last, your code appears to be incomplete and hence cannot be examined.

Upvotes: 0

Is my code inefficient or is my task hard?

Answers (3)

Related Questions