Reputation: 874
I am processing a large text file and as output I have a list of words:
['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December', ...]
What I want to achieve next is to transform everything to lowercase, remove all the words that belong to a stopset (commonly used words) and remove punctuation. I can do it by doing 3 iterations:
lower=[word.lower() for word in mywords]
removepunc=[word for word in lower if word not in string.punctuation]
final=[word for word in removepunc if word not in stopset]
I tried to use
[word for word in lower if word not in string.punctuation or word not in stopset]
to achieve what last 2 lines of code are supposed to do but it's not working. Where is my error and is there any faster way to achieve this than to iterate through the list 3 times?
Upvotes: 0
Views: 371
Reputation: 9904
If you use filter you can do it with one list comprehension and it is easier to read.
final = filter( lambda s: s not in string.punctation and s not in stopset ,[word.lower() for word in mywords])
Upvotes: 0
Reputation:
is there any faster way to achieve this than to iterate through the list 3 times?
Turn johnsharpe's code into a generator. This may drastically speed up the use and lower memory use as well.
import string
stopset = ['is']
mywords = ['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December']
final = (word.lower() for word in mywords if (word not in string.punctuation and
word not in stopset))
print "final", list(final)
To display results outside of an iterator for debugging, use list as in this example
Upvotes: 0
Reputation: 122169
You can certainly compress the logic:
final = [word for word in map(str.lower, mywords)
if word not in string.punctuation and word not in stopset]
For example, if I define stopset = ['if']
I get:
['today', 'cold', 'outside', '2013', 'december']
Upvotes: 1
Reputation: 304503
You can use map
to fold in the .lower
processing
final = [word for word in map(str.lower, mywords) if word not in string.punctuation and word not in stopset]
You can simply add string.punctuation
to stopset
, then it becomes
final = [word for word in map(str.lower, mywords) if word not in stopset]
Are sure you don't want to preserve the case of the words in the output though?
Upvotes: 0
Reputation: 114599
I'd guess the fastest approach is try to move as much as possible of the computation from Python to C. First precompute the set of forbidden strings. This needs to be done just once.
avoid = set(string.punctuation) | set(x.lower() for x in stopset)
then let the set subtraction operation to do as much of the filtering as possible
final = set(x.lower() for x in mywords) - avoid
Converting the whole source of words at once to lowercase before starting would probably improve speed too. In that case the code would be
final = set(mywords) - avoid
Upvotes: 0
Reputation: 11396
note that list comprehensions are not the best way to go when it comes to large files, as the entire file will have to be loaded to memory.
instead do something like Read large text files in Python, line by line without loading it in to memory
with open("log.txt") as infile:
for line in infile:
if clause goes here:
....
Upvotes: 0
Reputation: 64308
Here is the equivalent single list comprehension, although I agree with alko that what you already have is clearer:
final = [lower_word for word in mywords for lower_word in (word.lower(),) if lower_word not in string.punction and lower_word not in stopset]
Upvotes: 0
Reputation: 48397
If your code is working as intended, I don't think it's a good idea. Now it is well readable and can be easily modified with additional processing. One-liners are good for SO to get more upvotes, you'll get hard time understainding its logic some time later.
You can replace intermediate steps with generators instead of lists, to make your computation work once, and not to generate several lists:
lower = (word.lower() for word in mywords)
removepunc = (word for word in lower if word not in string.punctuation)
final = [word for word in removepunc if word not in stopset]
Upvotes: 2