李恒通
李恒通

Reputation: 33

Why doesn't NLTK wn.all_synsets() function in wordnet return a list of synsets?

I code for the question:

What percentage of noun synsets have no hyponyms? You can get all noun synsets using wn.all_synsets('n').

Here is my code:

import nltk
from nltk.corpus import wordnet as wn

all_noun = wn.all_synsets('n')
print(all_noun)
print(wn.all_synsets('n'))
all_num = len(set(all_noun))
noun_have_hypon = [word for word in wn.all_synsets('n') if len(word.hyponyms()) >= 1]
noun_have_num = len(noun_have_hypon)
print('There are %d nouns, and %d nouns without hyponyms, the percentage is %f' %
  (all_num, noun_have_num, (all_num-noun_have_num)/all_num*100))

when I run this code, the output is

<generator object all_synsets at 0x10927b1b0>

<generator object all_synsets at 0x10e6f0bd0>

There are 82115 nouns, and 16693 nouns without hyponyms, the percentage is 79.671193

but if change

noun_have_hypon = [word for word in wn.all_synsets('n') if len(word.hyponyms()) >= 1]

to

noun_have_hypon = [word for word in all_noun if len(word.hyponyms()) >= 1]

the output changes to

<generator object all_synsets at 0x10917b1b0>

<generator object all_synsets at 0x10e46aab0>

There are 82115 nouns, and 0 nouns without hyponyms, the percentage is 100.000000

why the two answers don't equal even though all_noun = wn.all_synsets('n'), and what's the meaning of 0x10927b1b0 & 0x10e6f0bd0?

Upvotes: 3

Views: 2732

Answers (1)

alvas
alvas

Reputation: 122270

It has little to do with NLTK but more of the difference between Generator Expressions vs. List Comprehension.

Let's go through a small example:

First, let's create a function that returns a simple list:

>>> def some_func_that_returns_a_list():
...     list_to_be_returned = []
...     for i in range(10):
...             list_to_be_returned.append(i)
...     return list_to_be_returned
... 
>>> some_func_that_returns_a_list()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Note that in the some_func_that_returns_a_list() function, a list needs to be created and values to be put in before the functions returns to the place in the code that calls it.

Similarly, we can use a generator to achieve the same list that needs to be return, but it's a little different since it's using the yield keyword:

>>> def some_func_that_returns_a_generator():
...     for i in range(10):
...             yield i
... 
>>> 

Note that in the function there are no instantiation of a list to be returned.

And when you try to call the function:

>>>some_func_that_returns_a_generator()
<generator object some_func_that_returns_a_generator at 0x7f312719a780>

You receive a string representation of the generator, i.e. just something that describes the function. At this point, there is no values instantiated and the generator's pointer, it is should be smaller than the function that instantiates a list:

>>> import sys
>>> sys.getsizeof(some_func_that_returns_a_generator())
80
>>> sys.getsizeof(some_func_that_returns_a_list())
200

Since generator don't instantiate the values of the resulting list you need, it just pops out the items that is being yield one at a time, you need to "manually" loop through generator to get the list, e.g.:

>>> list(some_func_that_returns_a_generator())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> [i for i in some_func_that_returns_a_generator()]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

But in this case, it's creating the list "on-the-fly" and if you're not going to stop the list but to read the elements out one at a time, a generator would be advantageous (memory wise).

See also:


So in the case of NLTK wn.all_synsets() WordNet API, you can simply do something like:

>>> from nltk.corpus import wordnet as wn
>>> nouns_in_wordnet = list(wn.all_synsets('n'))

But do note that, it will save the whole list of Synsets that are nouns in memory.

And if you want to filter the nouns with more than 1 hypernym, you can avoid instantiating a full list of nouns by using the filter() function:

>>> filter(lambda ss: len(ss.hypernyms()) > 0, wn.all_synsets('n'))

Finally to count it "on-the-fly" without storing the Synsets in memory, you can do:

>>> len(filter(lambda ss: len(ss.hypernyms()) > 0, wn.all_synsets('n')))
74389

or less verbosely:

>>> sum(1 for ss in wn.all_synsets('n') if len(ss.hypernyms()) > 0)
74389

But most likely, you would like to access the Synsets, so you might be looking for:

>>> nouns_with_hyper = filter(lambda ss: len(ss.hypernyms()) > 0, wn.all_synsets('n'))
>>> len(nouns_with_hyper)
74389

Upvotes: 6

Related Questions