Franziska S
Franziska S

Reputation: 1

Python: Only keep lines with more than 3 words if characters in words are alphabetic

the following code 1 allows me to only keep lines with more than 3 words. In some lines of my large text document, there are lines with non-alphabetic characters and 3 words or less which I would like to exclude as well from my cleaned list of lines. When using .isalpha() in code 2 it seems to not go line by line anymore when counting words in a line. I'm new to Python and would greatly appreciate if anyone could help me. The lines I want to keep would be lines_clean = ["This is some text as an", "what I want to"]

Code 1:

import nltk
from nltk.tokenize import line_tokenize, sent_tokenize, word_tokenize
f = "This is some text as an \n example of 5\n what I want to \n achieve \n with my #$ code"
lines = line_tokenize(f)
lines_clean = []
for line in lines:
    words = word_tokenize(line)
    n_words = len(words)
    if n_words >=3:
        lines_clean.append(line)
print(lines_clean)

Code 2 (not working as intended):

import nltk
from nltk.tokenize import line_tokenize, sent_tokenize, word_tokenize
f = "This is some text as an \n example of 5\n what I want to \n achieve \n with my #$ code"
lines = line_tokenize(f)
lines_clean = []
alpha_only = []
for line in lines:
    words = word_tokenize(line)
    for word in words:
        if word.isalpha():
            alpha_only.append(word)     
    n_words = len(alpha_only)
    if n_words >=3:
        lines_clean.append(line)
print(lines_clean)

Upvotes: 0

Views: 242

Answers (0)

Related Questions