Reputation: 2619
I am doing a word by word analysis of a sentence such as
"Hey there!! This is a excellent movie???"
I have many sentences like above.
I have a huge dataset file like shown below where I have to do a quick lookup if that word exists. If it does then do analysis and store in a dictionary such as get the score from the file of the word, score of last word of sentence, first word of sentence and so on.
sentence[i] => Hey there!! This is a excellent movie??? sentence[0] = Hey, sentence[1]=there!! sentence[2]=This and so on.
Here is the code:
def unigrams_nrc(file):
for line in file:
(term,score,numPos,numNeg) = re.split("\t", line.strip())
if re.match(sentence[i],term.lower()):
#presence or absence of unigrams of a target term
wordanalysis["unigram"] = found
else:
found = False
if found:
wordanalysis["trail_unigram"] = found if re.match(sentence[(len(sentence)-1)],term.lower()) else not(found)
wordanalysis["lead_unigram"] = found if re.match(sentence[0],term.lower()) else not(found)
wordanalysis["nonzero_sscore"] = float(score) if (float(score) != 0) else 0
wordanalysis["sscore>0"] = (float(score) > 0)
wordanalysis["sscore"] = (float(score) != 0)
if re.match(tweet[len(sentence)-1],term.lower()):
wordanalysis["sscore !=0 last token"] = (float(score) != 0)
Here is the file (more than 4000 words in this file):
#fabulous 7.526 2301 2
#excellent 7.247 2612 3
#superb 7.199 1660 2
#perfection 7.099 3004 4
#terrific 6.922 629 1
#magnificent 6.672 490 1
#sensational 6.529 849 2
#heavenly 6.484 2841 7
#ideal 6.461 3172 8
#partytime 6.111 559 2
#excellence 5.875 1325 6
@thisisangel 5.858 217 1
#wonderful 5.727 3428 18
elegant 5.665 537 3
#perfect 5.572 3749 23
#fine 5.423 2389 17
excellence 5.416 279 2
#realestate 5.214 114 1
bicycles 5.205 113 1
I wanted to know if there is a better way to do the above? Defining better way: Faster, less code and elegant. I am new to python so I know this is not the best code. I have around 4 files through which I have to go and check the score hence want to implement this function in the best possible way.
Upvotes: 3
Views: 1411
Reputation: 551
Maybe load the word/scores file once into memory as a dict of dicts, and then loop through each word in each sentence, checking the dict keys from your word file for each word in the sentence.
Would something like this work:
word_lookup = load_words(file)
for s in sentences:
run_sentence(s)
def load_words(file):
word_lookup = {}
for line in file:
(term,score,numPos,numNeg) = re.split("\t", line.strip())
if not words.has_key(term):
words[term] = {'score': score, 'numPos': numPos, 'numNeg': numNeg}
return word_lookup
def run_sentence(s):
s = standardize_sentence(s) # Assuming you want to strip punctuation, symbols, convert to lowercase, etc
words = s.split(' ')
first = words[0]
last = words[-1]
for word in words:
word_info = check_word(word)
if word_info:
# Matched word, use your scores somehow (word_info['score'], etc)
def check_word(word):
if word_lookup.has_key(word):
return word_lookup[word]
else:
return None
Upvotes: 1
Reputation: 19030
Here are my tips:
json.dumps()
json.laods()
Python dict
(s) are much faster for lookups with a complexity of O(1) than iteration which has O(n) -- So you'll get some performance benefit there as long as you load up your data file initially.
Examples(s):
from json import dumps, loads
def load_data(filename):
return json.loads(open(filename, "r").read())
def save_data(filename, data):
with open(filename, "w") as f:
f.write(dumps(data))
data = load_data("data.json")
foo = data["word"] # O(1) lookup of "word"
I would probably store your data like this:
data = {
"fabulous": [7.526, 2301, 2],
...
}
You would then do:
stats = data.get(word, None)
if stats is not None:
score, x, y = stats
...
NB: The ...
are NOT real code and placeholders where where you should fill in the blanks.
Upvotes: 3