Bowis
Bowis

Reputation: 632

Use NLTK RegexpTokenizer to remove text between square brackets

I am trying to remove all the text between square from a transcript using the NLTK RegexpTokenizer:

file = open('speakers.txt', 'r')
read_file = file.read()

tokenizer = nltk.RegexpTokenizer(r'\[\[(?:[^\]|]*\|)?([^\]|]*)\]\]')
new_words = tokenizer.tokenize(read_file)
print(new_words)
[]

However, this code results in a output of only the []. What do I need to change in order to make it overwrite the [] and its contents?

Upvotes: 1

Views: 478

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

You need to use the (?:\[[^][]*]|\s)+ regex and add the gaps=True argument to split with any string inside square brackets having no inner, nested brackets, and whitespace:

tokenizer = nltk.RegexpTokenizer(r'(?:\[[^][]*]|\s)+', gaps=True)

See the regex demo.

Pattern details

  • (?: - start of a non-capturing group:
    • \[[^][]*] - a [, then zero or more chars other than [ and ], and then ]
  • | - or
    • \s - a whitespace
  • )+ - one or more repetitions of the pattern sequences in the group.

Upvotes: 1

Related Questions