Reputation: 1
the following code 1 allows me to only keep lines with more than 3 words. In some lines of my large text document, there are lines with non-alphabetic characters and 3 words or less which I would like to exclude as well from my cleaned list of lines. When using .isalpha() in code 2 it seems to not go line by line anymore when counting words in a line. I'm new to Python and would greatly appreciate if anyone could help me. The lines I want to keep would be lines_clean = ["This is some text as an", "what I want to"]
Code 1:
import nltk
from nltk.tokenize import line_tokenize, sent_tokenize, word_tokenize
f = "This is some text as an \n example of 5\n what I want to \n achieve \n with my #$ code"
lines = line_tokenize(f)
lines_clean = []
for line in lines:
words = word_tokenize(line)
n_words = len(words)
if n_words >=3:
lines_clean.append(line)
print(lines_clean)
Code 2 (not working as intended):
import nltk
from nltk.tokenize import line_tokenize, sent_tokenize, word_tokenize
f = "This is some text as an \n example of 5\n what I want to \n achieve \n with my #$ code"
lines = line_tokenize(f)
lines_clean = []
alpha_only = []
for line in lines:
words = word_tokenize(line)
for word in words:
if word.isalpha():
alpha_only.append(word)
n_words = len(alpha_only)
if n_words >=3:
lines_clean.append(line)
print(lines_clean)
Upvotes: 0
Views: 242