niukasu
niukasu

Reputation: 287

How to split the text properly in python for glove?

In glove, punctuation like '.' is counted as a word. but in the case of u.s. and u.k. .it cannot be separated.

For example, there is a sentence.

he's going to u.s..

What glove want is ['he', ''s', 'going', 'to', 'u.s.', '.'] Are there any good ways to split that?

Upvotes: 0

Views: 374

Answers (1)

polm23
polm23

Reputation: 15593

You should split the input the same way the input used in training was split. If you are using pre-trained vectors and don't know how they were generated, you can train your own vectors or ask the creator how they tokenized their input.

Also, as a note, sentences don't end with a double period even if the last word is an abbreviation.

wrong: He's going to the U.S..
right: He's going to the U.S.

You can read a more detailed explanation of that here.

Also note that in modern English it's very common to not use periods in abbreviations - as an example, The Guardian has sections for "US News" and "UK News", without periods. As a practical matter, I think you don't need to worry about this particular issue unless it comes up a lot in your specific dataset.

Upvotes: 1

Related Questions