Chun Yong
Chun Yong

Reputation: 11

word tokenization in python

I am new to Python and text analytics, i wanted to tokenize my text corpus:

<s> c a b c b c </s>
<s> a c b a </s>
<s> c a c a c </s>

I wanted to tokenize them into ['<s>','c','a','b','c','</s>'], but what i got is:
['<', 's', '>', 'c', 'a', 'b', 'c', 'b', 'c', '<', '/s', '>']

the "s" and "/s" is seperated with <,> as a different token. Is there any way to fix this?

Here is the code:

import nltk
#read file
f = open('Text Corpus.txt','r')
corpus = f.read()
print (corpus)

#tokenize
tokens = nltk.word_tokenize(corpus)
print(tokens)

Upvotes: 0

Views: 508

Answers (1)

baduker
baduker

Reputation: 20022

This looks like markup. You can use BeautifulSoup to remove it.

import nltk

from bs4 import BeautifulSoup

corpus = """
<s> c a b c b c </s>
<s> a c b a </s>
<s> c a c a c </s>
"""

print(nltk.word_tokenize(BeautifulSoup(corpus, "html.parser").get_text()))

Output:

['c', 'a', 'b', 'c', 'b', 'c', 'a', 'c', 'b', 'a', 'c', 'a', 'c', 'a', 'c']

If, however, you want to keep the tags, just do this:

with open("sample.txt") as f:
    corpus = f.read().split()

print(corpus)

sample.txt holds the corpus example you gave.

Output:

['<s>', 'c', 'a', 'b', 'c', 'b', 'c', '</s>', '<s>', 'a', 'c', 'b', 'a', '</s>', '<s>', 'c', 'a', 'c', 'a', 'c', '</s>']

Upvotes: 1

Related Questions