Mark Singh
Mark Singh

Reputation: 13

How to fix tokenizing phrases such as T&C from being split into 'T' '&' 'C'

I'm trying to clean some text data ready for NLP techniques. I require patterns such as T&C and S&P to be left how they are. However, when I tokenize sentences it gets split into 'T' '&' 'C' rather than 'T&C' altogether.

I've tried looking for exemptions to the rule but cannot find a general way of completing this given any sequence i.e. FT&P or S&ST or S&T

import pandas as pd

from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords


en_stop = set(stopwords.words('english'))
en_stop = [word for word in en_stop]
[en_stop.append(x) for x in ['shall', 'should','please']]

def rem_stopwords(txt):
    words = [w for w in word_tokenize(txt) if w not in en_stop]
    return " ".join(words)

rem_stopwords('what is f&p doing in regards')
Out[163]: ['f', '&', 'p', 'regards']

I want the output to be ['f&p', 'regards']

Upvotes: 1

Views: 616

Answers (2)

TextGeek
TextGeek

Reputation: 1247

The tokenizers that come with NLP systems are sometimes pretty basic, and even advanced ones may handle some edge cases in ways you might not prefer for a particular project.

Bottom line: you have several options:

  • Find an off-the-shelf solution that does exactly what you want.

  • Find a setting or configuration that adjusts one to do what you want. Stanford nltk has several variations, such as casual, MWETokenizer, nist, and punkt, and some options like adding your own regexes to some of them (see https://www.nltk.org/api/nltk.tokenize.html).

  • Write code to change an existing solution (if it's open source you can change the code itself; many systems also have an API that lets you override certain parts without digging too far into the guts).

  • Write you own tokenizer from scratch (this is considerably harder than it looks).

  • Pre- or post-process the data to fix specific problems.

But ampersand may not be the only case you'll run into. I suggest going through each punctuation mark in turn, and spending a minute thinking about what you want to happen when it shows up. Then you'll have a clearer set of goals in mind when evaluating your options. For example:

"&" -- also shows up in URLs, and be careful of "<" if you're parsing HTML, and "&&" if you're parsing code.

"/" -- you probably don't want to tokenize URLs at every slash (and certainly don't want to try parsing the resulting tokens as if they were a sentence!). There's also 12/31/2019, 1/2, and many more cases.

"-" -- Hypens are highly ambiguous: -1, 12-4, the double hyphen for clause-level dash (and the decrement operator in some code), end-of-line hyphenation (which might or might not want to be closed up), long strings of hyphens as separator lines.

Quotes -- curly vs. straight, single-quote vs. apostrophe for contractions or possessives (or incorrectly for plurals), and so on.

Unicode introduces cases like different types of whitespace, quotes, and dashes. Many editors like to "auto-correct" to Unicode characters like those, and even fractions: 1/2 may end up as a single character (do you want the tokenizer to break that into 3 tokens?).

It's fairly easy (and imho, an extremely useful exercise) to write up a small set of test cases and try them out. Some of the existing tokenizers can be tried out online, for example:

Stanford [corenlp: http://corenlp.run/]

Python NLTK: [https://text-processing.com/demo/tokenize/]

Spacy: [http://textanalysisonline.com/spacy-word-tokenize]

MorphAdorner: [http://morphadorner.northwestern.edu/morphadorner/wordtokenizer/example/]

This is just a small sample -- there are many others, and some of these have a variety of options.

If you want a really quick-and-dirty solution for just this one case, you could post-process the token list to re-combine the problem cases, or pre-process it to turn r'\w&\w' into some magic string that the tokenizer won't break up, than turn it back afterward. Those are pretty much hacks, but in limited circumstances they might be ok.

Upvotes: 4

qaiser
qaiser

Reputation: 2868

you can use split function instead of word_tokenize if it work best for your data, but according to example text, split function can do the job for you

  def rem_stopwords(txt, en_stop):
     words = [w for w in txt.split() if w not in en_stop]
     return " ".join(words)

 #o/p
 rem_stopwords('what is f&p doing in regards', en_stop)
 'f&p regards'

Upvotes: 1

Related Questions