Arjun
Arjun

Reputation: 325

I need a Python regex to tokenize the sentences upon finding a "\\n"

I used a document converter to get the text from PDF. The text appears in the form:

"Hello Programmers\\nToday we will learn how to create a program in python\\nThefirst task is very easy and the level will exponentially increase\\nso please bare in mind that this course is not for the weak hearted\\n"

I am using NLTK to tokenize the document into sentence upon occurrence of \\n. I have used the below regex, but it doesn't work.

Please excuse me if the regex is wrong, I am new to it.

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'^[\n]')

>>> tokens
[]

..

#tokenizer = RegexpTokenizer('\\n')

>>> tokens
['\n']
>>> 

Even using \\n did not work. How can I write a correct regex?

Upvotes: 0

Views: 467

Answers (2)

Chaker
Chaker

Reputation: 1207

Hey you need to use gaps

>>> tokenizer = RegexpTokenizer(r'\\n', gaps=True)
>>> tokenizer.tokenize(s)
['Hello Programmers', 'Today we will learn how to create a program in python', 'Thefirst task is very easy and the level will exponentially increase', 'so please bare in mind that this course is not for the weak hearted']

A RegexpTokenizer splits a string into substrings using a regular expression. A RegexpTokenizer can use its regexp to match delimiters instead using gaps=True

Upvotes: 1

Astrophe
Astrophe

Reputation: 574

The most basic solution which may be useful is:

text = "Hello Programmers\\nToday we will learn how to create a program in python\\nThefirst task is very easy and the level will exponentially increase\\nso please bare in mind that this course is not for the weak hearted\\n"

each_line = text.split('\\n')

for i in each_line:
    print i

Upvotes: 1

Related Questions