Reputation: 287
In glove, punctuation like '.' is counted as a word. but in the case of u.s. and u.k. .it cannot be separated.
For example, there is a sentence.
he's going to u.s..
What glove want is ['he', ''s', 'going', 'to', 'u.s.', '.'] Are there any good ways to split that?
Upvotes: 0
Views: 374
Reputation: 15593
You should split the input the same way the input used in training was split. If you are using pre-trained vectors and don't know how they were generated, you can train your own vectors or ask the creator how they tokenized their input.
Also, as a note, sentences don't end with a double period even if the last word is an abbreviation.
wrong: He's going to the U.S..
right: He's going to the U.S.
You can read a more detailed explanation of that here.
Also note that in modern English it's very common to not use periods in abbreviations - as an example, The Guardian has sections for "US News" and "UK News", without periods. As a practical matter, I think you don't need to worry about this particular issue unless it comes up a lot in your specific dataset.
Upvotes: 1