Reputation: 325
I used a document converter to get the text from PDF. The text appears in the form:
"Hello Programmers\\nToday we will learn how to create a program in python\\nThefirst task is very easy and the level will exponentially increase\\nso please bare in mind that this course is not for the weak hearted\\n"
I am using NLTK to tokenize the document into sentence upon occurrence of \\n
. I have used the below regex, but it doesn't work.
Please excuse me if the regex is wrong, I am new to it.
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'^[\n]')
>>> tokens
[]
..
#tokenizer = RegexpTokenizer('\\n')
>>> tokens
['\n']
>>>
Even using \\n
did not work. How can I write a correct regex?
Upvotes: 0
Views: 467
Reputation: 1207
Hey you need to use gaps
>>> tokenizer = RegexpTokenizer(r'\\n', gaps=True)
>>> tokenizer.tokenize(s)
['Hello Programmers', 'Today we will learn how to create a program in python', 'Thefirst task is very easy and the level will exponentially increase', 'so please bare in mind that this course is not for the weak hearted']
A RegexpTokenizer
splits a string into substrings using a regular expression. A RegexpTokenizer can use its regexp to match delimiters instead using gaps=True
Upvotes: 1
Reputation: 574
The most basic solution which may be useful is:
text = "Hello Programmers\\nToday we will learn how to create a program in python\\nThefirst task is very easy and the level will exponentially increase\\nso please bare in mind that this course is not for the weak hearted\\n"
each_line = text.split('\\n')
for i in each_line:
print i
Upvotes: 1