pa1geek
pa1geek

Reputation: 268

Filter lines from file using stopwords in python

I have a text file with some lines of text. I need to filter out all the lines that start with lowercase letters and print only lines that start with uppercase. How do I do this in Python ?

I have tried this:

filtercase =('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')

out = []

ins = open("data.txt","r")
for line in ins:
   for k in filtercase:
      if(not(line.startswith(k))):
           out.append(line)

This will still print lines if they start with any lowercase letter other than 'a' . I am not sure how to change the code to make it work. Any help is appreciated.

EDITED: I have more stopword lists like these which I need to apply on the lines. So its not just a case of upper or lower case.

Upvotes: 0

Views: 310

Answers (5)

inspectorG4dget
inspectorG4dget

Reputation: 114025

Checking for lowercase can be really fast, by putting using the ascii code range for the lowercase letters. Once thus optimized, you could put all stop words in a set (for faster lookup). This yields the following code:

lowers = (ord('a'), ord('z'))
stopWords = set((i.lower() for i in "firstWord anotherWord".split()))
out = []
with open('data.txt') as infile:
    for line in infile:
        if lowers[0] <= line[0] <= lowers[1]:
            continue
        if line.split(None, 1)[0].lower() in stopWords:
            continue
        out.append(line)

Upvotes: 0

jayelm
jayelm

Reputation: 7678

Your original code iterates through every single letter in filtercase. If, for each letter, the line DOESN'T start with it, you append to your out list. But clearly, every single line would be appended multiple times, since for a line to NOT be appended to out, it must start with 'a', 'b', 'c', and every single letter in your filter list.

Rather, you need to iterate through filtercase, and need to find one instance of k where line.startswith(k) is true. If line.startswith any phrase in filtercase, don't append it; but if it successfully iterates through the entire list without starting with any of its elements, append.

Python's for-else syntax is very useful for checking through a list of elements:

out = []

with open('data.txt', 'r') as ins:
    for line in ins:
        for k in filtercase:
            if line.startswith(k): # If line starts with any of the filter words
                break # Else block isn't executed.
        else: # Line doesn't start with filter word, append to message
            out.append(line)

Upvotes: 2

Kamehameha
Kamehameha

Reputation: 5488

This works

fp = open("text.txt","r")
out = []
yesYes = xrange(ord('A'),ord('Z')+1)
noNo = xrange(ord('a'),ord('z')+1)
for line in fp:
    if len(line)>0 and ord(line[0]) in yesYes and ord(line[0]) not in noNo:
         out.append(line)

Or in a single line-

out = [line for line in open("text.txt","r") if len(line)>0 and ord(line[0]) in xrange(ord('A'),ord('Z')+1) and ord(line[0]) not in xrange(ord('a'),ord('z')+1)]

Upvotes: 0

Steinar Lima
Steinar Lima

Reputation: 7821

This solution uses regexp, and will only match lines that starts with a capital letter, and that does not contain any of the words in stopword. Note that e.g. the line 'messenger' will not be matched if one of the stopwords are 'me'.

import re

out = []
stopwords = ['no', 'please', 'dont']
lower = re.compile('^[a-z]')
upper = re.compile('^[A-Z]')
with open('data.txt') as ifile:
    for line in ifile:
        if (not lower.match(line) and
            not any(word in line for word in stopwords)) \
            and upper.match(line):
           out.append(line)

Upvotes: 0

Sunny Nanda
Sunny Nanda

Reputation: 2382

The following approach should work.

with open('data.txt', 'r') as ins:
    out = filter(lambda line: [sw for sw in filtercase if line.startswith(sw)] == [], ins.readlines())

Upvotes: 0

Related Questions