fscore
fscore

Reputation: 2619

word analysis and scoring from a file python

I am doing a word by word analysis of a sentence such as
"Hey there!! This is a excellent movie???"

I have many sentences like above. I have a huge dataset file like shown below where I have to do a quick lookup if that word exists. If it does then do analysis and store in a dictionary such as get the score from the file of the word, score of last word of sentence, first word of sentence and so on.

sentence[i] => Hey there!! This is a excellent movie??? sentence[0] = Hey, sentence[1]=there!! sentence[2]=This and so on.

Here is the code:

def unigrams_nrc(file):
   for line in file:
       (term,score,numPos,numNeg) = re.split("\t", line.strip())
       if re.match(sentence[i],term.lower()):
          #presence or absence of unigrams of a target term
          wordanalysis["unigram"] = found
       else:
          found = False
       if found:
          wordanalysis["trail_unigram"] = found if re.match(sentence[(len(sentence)-1)],term.lower()) else not(found)
          wordanalysis["lead_unigram"] = found  if re.match(sentence[0],term.lower()) else not(found)
          wordanalysis["nonzero_sscore"] = float(score) if (float(score) != 0) else 0             
          wordanalysis["sscore>0"] = (float(score) > 0)
          wordanalysis["sscore"] = (float(score) != 0)

       if re.match(tweet[len(sentence)-1],term.lower()):
          wordanalysis["sscore !=0 last token"] = (float(score) != 0)

Here is the file (more than 4000 words in this file):

#fabulous   7.526   2301    2
#excellent  7.247   2612    3
#superb 7.199   1660    2
#perfection 7.099   3004    4
#terrific   6.922   629 1
#magnificent    6.672   490 1
#sensational    6.529   849 2
#heavenly   6.484   2841    7
#ideal  6.461   3172    8
#partytime  6.111   559 2
#excellence 5.875   1325    6
@thisisangel    5.858   217 1
#wonderful  5.727   3428    18
elegant 5.665   537 3
#perfect    5.572   3749    23
#fine   5.423   2389    17
excellence  5.416   279 2
#realestate 5.214   114 1
bicycles    5.205   113 1

I wanted to know if there is a better way to do the above? Defining better way: Faster, less code and elegant. I am new to python so I know this is not the best code. I have around 4 files through which I have to go and check the score hence want to implement this function in the best possible way.

Upvotes: 3

Views: 1411

Answers (2)

Jeremy Gordon
Jeremy Gordon

Reputation: 551

Maybe load the word/scores file once into memory as a dict of dicts, and then loop through each word in each sentence, checking the dict keys from your word file for each word in the sentence.

Would something like this work:

word_lookup = load_words(file)
for s in sentences:
    run_sentence(s)

def load_words(file):
    word_lookup = {}
    for line in file:
        (term,score,numPos,numNeg) = re.split("\t", line.strip())
        if not words.has_key(term):
            words[term] = {'score': score, 'numPos': numPos, 'numNeg': numNeg}
    return word_lookup

def run_sentence(s):
    s = standardize_sentence(s) # Assuming you want to strip punctuation, symbols, convert to lowercase, etc
    words = s.split(' ')
    first = words[0]
    last = words[-1]
    for word in words:
        word_info = check_word(word)
        if word_info:
            # Matched word, use your scores somehow (word_info['score'], etc)

def check_word(word):
    if word_lookup.has_key(word):
        return word_lookup[word]
    else:
        return None

Upvotes: 1

James Mills
James Mills

Reputation: 19030

Here are my tips:

  • Write your file out as JSON using json.dumps()
  • Load in your file as JSON using json.laods()
  • Separate out your data loading from your analysis into separate logical code blocks. e.g: functions

Python dict(s) are much faster for lookups with a complexity of O(1) than iteration which has O(n) -- So you'll get some performance benefit there as long as you load up your data file initially.

Examples(s):

from json import dumps, loads


def load_data(filename):
    return json.loads(open(filename, "r").read())

def save_data(filename, data):
    with open(filename, "w") as f:
        f.write(dumps(data))

data = load_data("data.json")

foo = data["word"]  # O(1) lookup of "word"

I would probably store your data like this:

data = {
    "fabulous": [7.526, 2301, 2],
    ...
}

You would then do:

stats = data.get(word, None)
if stats is not None:
    score, x, y = stats
    ...

NB: The ... are NOT real code and placeholders where where you should fill in the blanks.

Upvotes: 3

Related Questions