Reputation: 11
I am new to Python and text analytics, i wanted to tokenize my text corpus:
<s> c a b c b c </s>
<s> a c b a </s>
<s> c a c a c </s>
I wanted to tokenize them into ['<s>','c','a','b','c','</s>'], but what i got is:
['<', 's', '>', 'c', 'a', 'b', 'c', 'b', 'c', '<', '/s', '>']
the "s" and "/s" is seperated with <,> as a different token. Is there any way to fix this?
Here is the code:
import nltk
#read file
f = open('Text Corpus.txt','r')
corpus = f.read()
print (corpus)
#tokenize
tokens = nltk.word_tokenize(corpus)
print(tokens)
Upvotes: 0
Views: 508
Reputation: 20022
This looks like markup. You can use BeautifulSoup
to remove it.
import nltk
from bs4 import BeautifulSoup
corpus = """
<s> c a b c b c </s>
<s> a c b a </s>
<s> c a c a c </s>
"""
print(nltk.word_tokenize(BeautifulSoup(corpus, "html.parser").get_text()))
Output:
['c', 'a', 'b', 'c', 'b', 'c', 'a', 'c', 'b', 'a', 'c', 'a', 'c', 'a', 'c']
If, however, you want to keep the tags, just do this:
with open("sample.txt") as f:
corpus = f.read().split()
print(corpus)
sample.txt
holds the corpus example you gave.
Output:
['<s>', 'c', 'a', 'b', 'c', 'b', 'c', '</s>', '<s>', 'a', 'c', 'b', 'a', '</s>', '<s>', 'c', 'a', 'c', 'a', 'c', '</s>']
Upvotes: 1