invictus
invictus

Reputation: 1971

Force wordcloud python module to include all words

I'm using the wordcloud module in Python by Andreas Mueller to visualize results of a survey my students will complete. Brilliant module, very nice pictures, however I have trouble making it recognize all words, even when setting stopwords=None and ranks_only=True. The survey responses are between one and three words long and may contain hyphens.

Here is an example. First I install dependencies in my Jupyter notebook:

import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud
from scipy.misc import imread

Then suppose I put all the responses into a string:

words = "do do do do do do do do do do re re re re re mi mi fa fa fa fa fa fa fa fa fa fa-so fa-so fa-so fa-so fa-so so la ti do"

Then I execute the plot:

wordcloud = WordCloud(ranks_only = True,stopwords=None).generate(words)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

But for some reason it ignores "do" and "fa-so" despite their high frequency.

Any tips? Besides "don't use a word cloud". It is a silly survey and it invites a silly visualization. Thanks.

Update

Still unable to include hyphenated words (e.g. "fa-so"), they just drop out.

Upvotes: 0

Views: 3034

Answers (1)

Looking at wordcloud.py, if the stopwords parameter is None, it uses the builtin STOPWORDS set - so you aren't suppressing use of stopwords. Try calling it with stopwords=set().

The built in tokenization in wordcloud.py recognizes a word as a series of alphanumeric characters (so fa-so gets split into fa and so) ignoring case, and also merges simple plurals (e.g. dogs into dog) and ignores single digits. If you want to bypass this, you need to build a list of tuples, each containing a word and its frequency, then call WordCloud.generate_from_frequencies(freqs).

I can't install wordcloud, but this simplified tokenization using \S+ (i.e. it recognizes consecutive non-whitespace characters as a word) in the wordfreq function definitely works:

import re
from operator import itemgetter

words = "do do do do do do do do do do re re re re re mi mi fa-so fa fa fa fa fa fa fa fa fa-so fa-so fa-so fa-so fa-so so la ti do"

item1 = itemgetter(1)

def wordfreq(text):
    d = {}
    for word in re.findall(r"\S+", text):
#    for word in re.findall(r"\w[\w']*", text):
        if word.isdigit():
            continue

        word_lower = word.lower()

        # Look in lowercase dict.
        if word_lower in d:
            d2 = d[word_lower]
        else:
            d2 = {}
            d[word_lower] = d2

        # Look in any case dict.
        d2[word] = d2.get(word, 0) + 1

    d3 = {}
    for d2 in d.values():
        # Get the most popular case.
        first = max(d2.items(), key=item1)[0]
        d3[first] = sum(d2.values())

    return d3.items()

freqs = wordfreq(words)

print freqs

# prints: [('do', 11), ('la', 1), ('fa-so', 6), ('mi', 2), ('fa', 8), ('so', 1), ('ti', 1), ('re', 5)]

# pass freqs to WordCloud.generate_from_frequencies()
# maybe something like:
#   wordcloud = WordCloud(ranks_only = True,stopwords=set()).generate_from_frequencies(freqs)

You can look at the source code of wordcloud.py - you are able to modify that directly or perhaps more safely and update-resistant you can extend/modify behaviour like this example.

Upvotes: 3

Related Questions