Tokenizing
and characters within a string

Question

Trying to tokenize sentences in python using nltk except I want to tokenize the and characters as well.

Example:

In: "This is a test"

Out: ['This', 'is', 'a', ' ', 'test']

Is there a way that's directly supported to do this?

Dani Mesejo · Accepted Answer

You could use a regex:

import re

text = "This is a
 test with	also"
pattern = re.compile('[^	
]+|[	
]+')

output = [val for values in map(pattern.findall, text.split(' ')) for val in values]
print(output)

Output

['This', 'is', 'a', '
', 'test', 'with', '	', 'also']

The idea is to first split on a single whitespace then, apply findall for each element in the list resulting from the split. The pattern [^ ]+|[ ]+ matches everything that is not a tab or a newline and multiple times and also everything that is a new line or tab multiple times. If you want to consider each tab and newline as a single token change the pattern to:

import re

text = "This is a
 test

with		also"
pattern = re.compile('[^	
]+|[	
]')
output = [val for values in map(pattern.findall, text.split(' ')) for val in values]
print(output)

Output

['This', 'is', 'a', '
', 'test', '
', '
', 'with', '	', '	', 'also']

Tokenizing \n and \t characters within a string

Answers (1)

Related Questions