Reputation: 268
I have a text file with some lines of text. I need to filter out all the lines that start with lowercase letters and print only lines that start with uppercase. How do I do this in Python ?
I have tried this:
filtercase =('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')
out = []
ins = open("data.txt","r")
for line in ins:
for k in filtercase:
if(not(line.startswith(k))):
out.append(line)
This will still print lines if they start with any lowercase letter other than 'a' . I am not sure how to change the code to make it work. Any help is appreciated.
EDITED: I have more stopword lists like these which I need to apply on the lines. So its not just a case of upper or lower case.
Upvotes: 0
Views: 310
Reputation: 114025
Checking for lowercase can be really fast, by putting using the ascii code range for the lowercase letters. Once thus optimized, you could put all stop words in a set (for faster lookup). This yields the following code:
lowers = (ord('a'), ord('z'))
stopWords = set((i.lower() for i in "firstWord anotherWord".split()))
out = []
with open('data.txt') as infile:
for line in infile:
if lowers[0] <= line[0] <= lowers[1]:
continue
if line.split(None, 1)[0].lower() in stopWords:
continue
out.append(line)
Upvotes: 0
Reputation: 7678
Your original code iterates through every single letter in filtercase. If, for each letter, the line DOESN'T start with it, you append to your out list. But clearly, every single line would be appended multiple times, since for a line to NOT be appended to out
, it must start with 'a'
, 'b'
, 'c'
, and every single letter in your filter list.
Rather, you need to iterate through filtercase
, and need to find one instance of k
where line.startswith(k)
is true. If line.startswith
any phrase in filtercase
, don't append it; but if it successfully iterates through the entire list without starting with any of its elements, append.
Python's for-else syntax is very useful for checking through a list of elements:
out = []
with open('data.txt', 'r') as ins:
for line in ins:
for k in filtercase:
if line.startswith(k): # If line starts with any of the filter words
break # Else block isn't executed.
else: # Line doesn't start with filter word, append to message
out.append(line)
Upvotes: 2
Reputation: 5488
This works
fp = open("text.txt","r")
out = []
yesYes = xrange(ord('A'),ord('Z')+1)
noNo = xrange(ord('a'),ord('z')+1)
for line in fp:
if len(line)>0 and ord(line[0]) in yesYes and ord(line[0]) not in noNo:
out.append(line)
Or in a single line-
out = [line for line in open("text.txt","r") if len(line)>0 and ord(line[0]) in xrange(ord('A'),ord('Z')+1) and ord(line[0]) not in xrange(ord('a'),ord('z')+1)]
Upvotes: 0
Reputation: 7821
This solution uses regexp, and will only match lines that starts with a capital letter, and that does not contain any of the words in stopword. Note that e.g. the line 'messenger'
will not be matched if one of the stopwords are 'me'
.
import re
out = []
stopwords = ['no', 'please', 'dont']
lower = re.compile('^[a-z]')
upper = re.compile('^[A-Z]')
with open('data.txt') as ifile:
for line in ifile:
if (not lower.match(line) and
not any(word in line for word in stopwords)) \
and upper.match(line):
out.append(line)
Upvotes: 0
Reputation: 2382
The following approach should work.
with open('data.txt', 'r') as ins:
out = filter(lambda line: [sw for sw in filtercase if line.startswith(sw)] == [], ins.readlines())
Upvotes: 0