Reputation: 5
Trying to tokenize sentences in python using nltk except I want to tokenize the \n and \t characters as well.
Example:
In: "This is a\n test"
Out: ['This', 'is', 'a', '\n', 'test']
Is there a way that's directly supported to do this?
Upvotes: 0
Views: 554
Reputation: 61910
You could use a regex:
import re
text = "This is a\n test with\talso"
pattern = re.compile('[^\t\n]+|[\t\n]+')
output = [val for values in map(pattern.findall, text.split(' ')) for val in values]
print(output)
Output
['This', 'is', 'a', '\n', 'test', 'with', '\t', 'also']
The idea is to first split on a single whitespace then, apply findall for each element in the list resulting from the split. The pattern [^\t\n]+|[\t\n]+
matches everything that is not a tab or a newline and multiple times and also everything that is a new line or tab multiple times. If you want to consider each tab and newline as a single token change the pattern to:
import re
text = "This is a\n test\n\nwith\t\talso"
pattern = re.compile('[^\t\n]+|[\t\n]')
output = [val for values in map(pattern.findall, text.split(' ')) for val in values]
print(output)
Output
['This', 'is', 'a', '\n', 'test', '\n', '\n', 'with', '\t', '\t', 'also']
Upvotes: 1