Reputation: 33
I'm using Stanford parser from the command line:
java -mx1500m -cp stanford-parser.jar;stanford-parser-models.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz {file}
When I'm running the command on a single sentence with 27 words, The Java process is consuming 100MB of memory and parsing takes 1.5 seconds. When I'm running the command on a single sentence with 148 words, The Java process is consuming 1.5GB of memory, and parsing takes 1.5 minutes.
The machine I'm using is windows 7 with intel i5 2.53GH.
Are these processing times reasonable? Is there any official performance benchmark for the parser?
Upvotes: 1
Views: 1556
Reputation: 121992
As commented, your problem lies in sentence segmentation since your data allows any input (with/without proper punctuation). But somehow it's nice that you have capitalization. So you can try the recipe below to segment sentence by capitalization.
Disclaimer: If your sentence starts with I
, then the recipe below isn't going to help much =)
"Something gotta change It must be rearranged I'm sorry, I did not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night, good night Hope that things work out all right So much to love, so much to learn But I won't be there to teach you Oh, I know I can be close But I try my best to reach you I'm so sorry I didn't not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night Good night, good night, good night good night, good night Hope that things work out all right, yeah Thank you."
In Python, you can try this to segment the sentence:
sentence = "Something gotta change It must be rearranged I'm sorry, I did not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night, good night Hope that things work out all right So much to love, so much to learn But I won't be there to teach you Oh, I know I can be close But I try my best to reach you I'm so sorry I didn't not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night Good night, good night, good night good night, good night Hope that things work out all right, yeah Thank you."
temp = []; sentences = []
for i in sentence.split():
if i[0].isupper() and i != "I":
sentences.append(" ".join(temp))
temp = [i]
else:
temp.append(i)
sentences.append(" ".join(temp))
sentences.pop(0)
print sentences
Then later, follow this Stanford Parser and NLTK to parse the sentence.
Upvotes: 2