I need a Python regex to tokenize the sentences upon finding a "\n"

Question

I used a document converter to get the text from PDF. The text appears in the form:

"Hello Programmers\nToday we will learn how to create a program in python\nThefirst task is very easy and the level will exponentially increase\nso please bare in mind that this course is not for the weak hearted\n"

I am using NLTK to tokenize the document into sentence upon occurrence of \n. I have used the below regex, but it doesn't work.

Please excuse me if the regex is wrong, I am new to it.

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'^[
]')

>>> tokens
[]

..

#tokenizer = RegexpTokenizer('\n')

>>> tokens
['
']
>>>

Even using \n did not work. How can I write a correct regex?

Chaker · Accepted Answer

Hey you need to use gaps

>>> tokenizer = RegexpTokenizer(r'\n', gaps=True)
>>> tokenizer.tokenize(s)
['Hello Programmers', 'Today we will learn how to create a program in python', 'Thefirst task is very easy and the level will exponentially increase', 'so please bare in mind that this course is not for the weak hearted']

A RegexpTokenizer splits a string into substrings using a regular expression. A RegexpTokenizer can use its regexp to match delimiters instead using gaps=True

I need a Python regex to tokenize the sentences upon finding a "\\n"

Answers (2)

Related Questions

I need a Python regex to tokenize the sentences upon finding a &quot;\\n&quot;

Answers (2)

Related Questions

I need a Python regex to tokenize the sentences upon finding a "\\n"