Use NLTK RegexpTokenizer to remove text between square brackets

Question

I am trying to remove all the text between square from a transcript using the NLTK RegexpTokenizer:

file = open('speakers.txt', 'r')
read_file = file.read()

tokenizer = nltk.RegexpTokenizer(r'$$\[(?:[^$$|]*\|)?([^\]|]*)\]\]')
new_words = tokenizer.tokenize(read_file)
print(new_words)
[]

However, this code results in a output of only the []. What do I need to change in order to make it overwrite the [] and its contents?

Wiktor Stribiżew · Accepted Answer

You need to use the (?:\[[^][]*]|\s)+ regex and add the gaps=True argument to split with any string inside square brackets having no inner, nested brackets, and whitespace:

tokenizer = nltk.RegexpTokenizer(r'(?:\[[^][]*]|\s)+', gaps=True)

See the regex demo.

Pattern details

(?: - start of a non-capturing group:
- \[[^][]*] - a [, then zero or more chars other than [ and ], and then ]
| - or
- \s - a whitespace
)+ - one or more repetitions of the pattern sequences in the group.

Use NLTK RegexpTokenizer to remove text between square brackets

Answers (1)

Related Questions