VISHAL GADHVI
VISHAL GADHVI

Reputation: 23

Word tokeinizing from the list of words in python?

my program has list of words and amongst that i need few specific words to be tokenized as one word.
my program would split a string into words eg

str="hello my name is vishal, can you please help me with the red blood cells and platelet count. The white blood cell is a single  word."

output will be

list=['hello','my','name','is','vishal','can','you','please','help','me','with','the','red','blood','cells','and','platelet','count','the','white','blood','cell','is','a','single','word']..

now I want is to tokenize words such as 'red blood cell' as a single word. There are many such words in my list that has three or more words to be considered as one such as 'platelet count','white blood cell',etc. any suggestion for doing that.

Upvotes: 1

Views: 93

Answers (2)

Sebin Sunny
Sebin Sunny

Reputation: 1803

from nltk.tokenize import MWETokenizer
str="hello my name is vishal, can you please help me with the red blood cells and platelet count. The white blood cell is a single  word."
tokenizer = MWETokenizer()
tokenizer.add_mwe(('red', 'blood', 'cells'))
tokenizer.add_mwe(('white','blood','cell'))
tokenizer.add_mwe(('platelet','count.'))
print(tokenizer.tokenize(str.split()))

Output
`['hello', 'my', 'name', 'is', 'vishal,', 'can', 'you', 'please', 'help', 'me', 'with', 'the', 'red_blood_cells', 'and', 'platelet_count.', 'The', 'white_blood_cell', 'is', 'a', 'single', 'word.']`

You can use the Multi-Word Expression Tokenizer MWETokenizer

Upvotes: 2

Balraj Ashwath
Balraj Ashwath

Reputation: 1427

N-grams can be used to group together consecutive words that repeat often in a corpus of text. Here's a wikipedia article about N-grams.

To implement this in Scikit-learn, you need to set n_gram_range parameter to the range of N-grams (bi-grams, tri-grams, ...) needed for the task (in your case, it is n_gram_range=(1,3)).

Upvotes: 1

Related Questions