Screamcheese
Screamcheese

Reputation: 139

Python nested dictionary comprehension with file objects

Fairly new to Python. I'm working on making a code more eloquent by trying to minimally write a nested for loop within a nested for loop to ultimate create a dictionary, where the dictionary includes the words as key and the frequency of words as values in a file. I believe I figured out how to do the inner for loop using dictionary comprehension but am having trouble figuring out the syntax for the outer for loop. I am guessing the outer for loop would be set up as a list comprehension expression. Currently I am not going to worry about what type of character is being considered a word (symbol, number, alphabet), and am trying to avoid importing any additional libraries. Could you maybe show me some examples, or point me to a resource I could read up more into nested comprehensions/advanced comprehensions?

The "brute force" fundamental method I originally developed looks along the lines of this:

word_cache = {}
# Some code here

with open('myfile.txt') as lines:
    for line in lines:
        for word in line.split():
            word_cache[word]=word_cache.get(word,0)+1


    '''        
    Below is alternatively what I have for dictionary comprehension. 
    The "for line in lines" is what I am having difficulty trying to nest which I believe would replace the "line in the dictionary comprehension". Part of the issue I see is lines is considered a file object.
    '''
    word_cache.update({word:word_cache.get(word,0)+1 for word in line.split()})

    # Tried the below but did not work because this is the (line for line in lines) is a generator expression
    word_cache.update({word:word_cache.get(word,0)+1 for word in (line for line in lines).split()})

Could someone help me understand what is the correct syntax for nested comprehensions of file objects (assuming the object file comes from a txt file)?

Upvotes: 0

Views: 244

Answers (2)

aydow
aydow

Reputation: 3801

A comprehension won't work in this case as you are relying on the container to reference itself. You will get a NameError as word_cache won't have been defined yet.

Your original code is something like this

# initialising the dict
cache = {}

with open('myfile.txt') as lines:
    for line in lines:
        for word in line.split():
            # referencing the dict that has been initialised
            cache[word] = cache.get(word, 0) + 1

What you might want to try is something like this

with open('myfile.txt') as lines:
    word_cache = {word: word_cache.get(word, 0) + 1 for line in lines for word in line.split()}            

This won't work because comprehensions create the object first and then perform assignment second. Therefore, when you use word_cache.get, Python has no idea what you're referring to as word_cache hasn't been created yet!

e.g.

In [1]: a = [a[0] + i for i in range(3)]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-3186711e5c1b> in <module>
----> 1 a = [a[0] + i for i in range(3)]

<ipython-input-1-3186711e5c1b> in <listcomp>(.0)
----> 1 a = [a[0] + i for i in range(3)]

NameError: name 'a' is not defined

Consider using a Counter from collections.

In [1]: from collections import Counter

In [2]: with open('/path/to.file') as f:
   ...:     cache = c.Counter(f.read().split())
   ...:

It's important to use the right tools for the job. In this case, it's a Counter.

More importantly, who is saying that your initial solution is not elegant or straightforward? A comprehension doesn't make a solution more elegant or readable.

An example of this (as you mention that you don't want to use additional libraries) might be

In [24]: with open('/path/to.file') as f:
    ...:     words = f.read().split()
    ...:     cache = {word: words.count(word) for word in set(words)}
    ...:

But to me this is not elegant and it's going to be much slower. I know speed is not exactly what you asked about but the difference is instructive and shouldn't be ignored.

In [24]: with open('/path/to.file') as f:
    ...:     words = f.read().split()
    ...:

In [25]: %%timeit cache = {}
    ...: for word in words:
    ...:     cache[word] = cache.get(word, 0) + 1
    ...:
26.7 µs ± 802 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [26]: %timeit cache = Counter(words)
12.9 µs ± 703 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [27]: %timeit cache = {word: words.count(word) for word in set(words)}
451 µs ± 60.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [28]: len(words)
Out[28]: 245

As you can see Counter is the fastest with the regular for loop being twice as slow. The comprehension is 30 times slower and this is for a word list of 245. The reason is that we're looping over the word list every time with words.count(word).

This might not be the most optimal comprehension we can come up with but you can now see that just because you can, doesn't mean you should. It's neither fast nor elegant.

To further demonstrate the how bad this solution is we can increase the file size.

In [30]: with open('/Downloads/sample-2mb-text-file.txt') as f:
    ...:     words = f.read().split()
    ...:

In [31]: len(words)
Out[31]: 322392

In [32]: %%timeit cache = {}
    ...: for word in words:
    ...:     cache[word] = cache.get(word, 0) + 1
    ...:
51 ms ± 6.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [33]: %timeit cache = c.Counter(words)
26.6 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [34]: %timeit cache = {word: words.count(word) for word in set(words)}
2.54 s ± 72.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can see here that the Counter and for loop solutions scale relative to one another (as before, Counter is about twice as fast). Our dict comprehension on the other hand is now nearly 100 times slower compared to Counter.

Upvotes: 0

Jacques Gaudin
Jacques Gaudin

Reputation: 17008

Just put the for loops one after another:

{word: word_cache.get(word,0) + 1 for word in line .split() for line in lines}

See the last example of PEP 274

As mentioned in the comments, a comprehension doesn't really help in your context as the assignment will only occur when the loops are completed.

Comprehensions are sometimes very useful, but they're only syntactic sugar. Nothing that a plain for loop can't do.

Upvotes: 0

Related Questions