Reputation: 632
I am trying to remove all the text between square from a transcript using the NLTK RegexpTokenizer:
file = open('speakers.txt', 'r')
read_file = file.read()
tokenizer = nltk.RegexpTokenizer(r'\[\[(?:[^\]|]*\|)?([^\]|]*)\]\]')
new_words = tokenizer.tokenize(read_file)
print(new_words)
[]
However, this code results in a output of only the []. What do I need to change in order to make it overwrite the [] and its contents?
Upvotes: 1
Views: 478
Reputation: 626929
You need to use the (?:\[[^][]*]|\s)+
regex and add the gaps=True
argument to split with any string inside square brackets having no inner, nested brackets, and whitespace:
tokenizer = nltk.RegexpTokenizer(r'(?:\[[^][]*]|\s)+', gaps=True)
See the regex demo.
Pattern details
(?:
- start of a non-capturing group:
\[[^][]*]
- a [
, then zero or more chars other than [
and ]
, and then ]
|
- or
\s
- a whitespace)+
- one or more repetitions of the pattern sequences in the group.Upvotes: 1