Raven
Raven

Reputation: 43

removing multiple \n in python before sentence tokenizing

I'm brand new to programming and I am teaching myself out of a book and Stack Overflow. I'm trying to remove multiple instances of \n in a a chat corpus and then tokenize the sentences. If I don't remove the \n, the strings look like this:

['answers for 10-19-20sUser139 ... hi 10-19-20sUser101 ;)\n\n\n\n\n\n\n\n\n\nI like it when you do it, 10-19-20sUser83\n\n\n\n\n\n\n\n\n\n\n\niamahotnipwithpics\n\n\n\n10-19-20sUser20 go plan the wedding!']

I've tried several different methods like chomps, line, rstrip, etc and none of them seem to work. It could be I am using them wrong. The whole code looks like this:

import nltk, re, pprint
from nltk.corpus import nps_chat
chat= nltk.Text(nps_chat.words())
from nltk.corpus import NPSChatCorpusReader
from bs4 import BeautifulSoup
chat=nltk.corpus.nps_chat.raw()
soup= BeautifulSoup(chat)
soup.get_text()
text =soup.get_text()
print(text[:40])
print(len(text))
from nltk.tokenize import sent_tokenize
sent_chat = sent_tokenize(text)
len(sent_chat)
text[:] = [line.rstrip('\n') for line in text]
print(len(sent_chat))
print(sent_chat[:40])

When I use the line method I get this error:

Traceback (most recent call last):
File "C:\Python34\Lib\idlelib\testsubjects\sentencelen.py", line 57, in <module>
text[:] = [line.rstrip('\n') for line in text]
TypeError: 'str' object does not support item assignment

Help?

Upvotes: 1

Views: 1029

Answers (2)

Raven
Raven

Reputation: 43

Actually I discovered on accident that if you tokenize the text into words first and then sentences the \n disappear! Thanks for your help!

Upvotes: 0

alvas
alvas

Reputation: 122142

>>> x = 'answers for 10-19-20sUser139 ... hi 10-19-20sUser101 ;)\n\n\n\n\n\n\n\n\n\nI like it when you do it, 10-19-20sUser83\n\n\n\n\n\n\n\n\n\n\n\niamahotnipwithpics\n\n\n\n10-19-20sUser20 go plan the wedding!'
>>> y = "".join([i if i !="\n" else "\t" for i in x])
>>> z = [i for i in y.split('\t') if i]
>>> z
['answers for 10-19-20sUser139 ... hi 10-19-20sUser101 ;)', 'I like it when you do it, 10-19-20sUser83', 'iamahotnipwithpics', '10-19-20sUser20 go plan the wedding!']

Upvotes: 2

Related Questions