Moon Spoon
Moon Spoon

Reputation: 5

Tokenizing \n and \t characters within a string

Trying to tokenize sentences in python using nltk except I want to tokenize the \n and \t characters as well.

Example:

In: "This is a\n test"

Out: ['This', 'is', 'a', '\n', 'test']

Is there a way that's directly supported to do this?

Upvotes: 0

Views: 554

Answers (1)

Dani Mesejo
Dani Mesejo

Reputation: 61910

You could use a regex:

import re

text = "This is a\n test with\talso"
pattern = re.compile('[^\t\n]+|[\t\n]+')

output = [val for values in map(pattern.findall, text.split(' ')) for val in values]
print(output)

Output

['This', 'is', 'a', '\n', 'test', 'with', '\t', 'also']

The idea is to first split on a single whitespace then, apply findall for each element in the list resulting from the split. The pattern [^\t\n]+|[\t\n]+ matches everything that is not a tab or a newline and multiple times and also everything that is a new line or tab multiple times. If you want to consider each tab and newline as a single token change the pattern to:

import re

text = "This is a\n test\n\nwith\t\talso"
pattern = re.compile('[^\t\n]+|[\t\n]')
output = [val for values in map(pattern.findall, text.split(' ')) for val in values]
print(output)

Output

['This', 'is', 'a', '\n', 'test', '\n', '\n', 'with', '\t', '\t', 'also']

Upvotes: 1

Related Questions