Anastasia
Anastasia

Reputation: 874

Is it possible to shorten this using list comprehension?

I am processing a large text file and as output I have a list of words:

['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December', ...]

What I want to achieve next is to transform everything to lowercase, remove all the words that belong to a stopset (commonly used words) and remove punctuation. I can do it by doing 3 iterations:

lower=[word.lower() for word in mywords]
removepunc=[word for word in lower if word not in string.punctuation]
final=[word for word in removepunc if word not in stopset]

I tried to use

[word for word in lower if word not in string.punctuation or word not in stopset]

to achieve what last 2 lines of code are supposed to do but it's not working. Where is my error and is there any faster way to achieve this than to iterate through the list 3 times?

Upvotes: 0

Views: 371

Answers (8)

Javier Castellanos
Javier Castellanos

Reputation: 9904

If you use filter you can do it with one list comprehension and it is easier to read.

final = filter( lambda s: s not in string.punctation and s not in stopset  ,[word.lower() for word in mywords])

Upvotes: 0

user3078690
user3078690

Reputation:

is there any faster way to achieve this than to iterate through the list 3 times?

Turn johnsharpe's code into a generator. This may drastically speed up the use and lower memory use as well.

import string
stopset = ['is']
mywords = ['today', ',', 'is', 'cold', 'outside', '2013', '?', 'December']
final = (word.lower() for word in mywords if (word not in string.punctuation and 
                                              word not in stopset))
print "final", list(final) 

To display results outside of an iterator for debugging, use list as in this example

Upvotes: 0

jonrsharpe
jonrsharpe

Reputation: 122169

You can certainly compress the logic:

final = [word for word in map(str.lower, mywords)
         if word not in string.punctuation and word not in stopset]

For example, if I define stopset = ['if'] I get:

 ['today', 'cold', 'outside', '2013', 'december']

Upvotes: 1

John La Rooy
John La Rooy

Reputation: 304503

You can use map to fold in the .lower processing

final = [word for word in map(str.lower, mywords) if word not in string.punctuation and word not in stopset]

You can simply add string.punctuation to stopset, then it becomes

final = [word for word in map(str.lower, mywords) if word not in stopset]

Are sure you don't want to preserve the case of the words in the output though?

Upvotes: 0

6502
6502

Reputation: 114599

I'd guess the fastest approach is try to move as much as possible of the computation from Python to C. First precompute the set of forbidden strings. This needs to be done just once.

avoid = set(string.punctuation) | set(x.lower() for x in stopset)

then let the set subtraction operation to do as much of the filtering as possible

final = set(x.lower() for x in mywords) - avoid

Converting the whole source of words at once to lowercase before starting would probably improve speed too. In that case the code would be

final = set(mywords) - avoid

Upvotes: 0

Guy Gavriely
Guy Gavriely

Reputation: 11396

note that list comprehensions are not the best way to go when it comes to large files, as the entire file will have to be loaded to memory.

instead do something like Read large text files in Python, line by line without loading it in to memory

with open("log.txt") as infile:
    for line in infile:
        if clause goes here:
            ....

Upvotes: 0

Vaughn Cato
Vaughn Cato

Reputation: 64308

Here is the equivalent single list comprehension, although I agree with alko that what you already have is clearer:

final = [lower_word for word in mywords for lower_word in (word.lower(),) if lower_word not in string.punction and lower_word not in stopset]

Upvotes: 0

alko
alko

Reputation: 48397

If your code is working as intended, I don't think it's a good idea. Now it is well readable and can be easily modified with additional processing. One-liners are good for SO to get more upvotes, you'll get hard time understainding its logic some time later.

You can replace intermediate steps with generators instead of lists, to make your computation work once, and not to generate several lists:

lower = (word.lower() for word in mywords)
removepunc = (word for word in lower if word not in string.punctuation)
final = [word for word in removepunc if word not in stopset]

Upvotes: 2

Related Questions