Reputation: 1971
I'm using the wordcloud module in Python by Andreas Mueller to visualize results of a survey my students will complete. Brilliant module, very nice pictures, however I have trouble making it recognize all words, even when setting stopwords=None
and ranks_only=True
. The survey responses are between one and three words long and may contain hyphens.
Here is an example. First I install dependencies in my Jupyter notebook:
import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud
from scipy.misc import imread
Then suppose I put all the responses into a string:
words = "do do do do do do do do do do re re re re re mi mi fa fa fa fa fa fa fa fa fa fa-so fa-so fa-so fa-so fa-so so la ti do"
Then I execute the plot:
wordcloud = WordCloud(ranks_only = True,stopwords=None).generate(words)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
But for some reason it ignores "do" and "fa-so" despite their high frequency.
Any tips? Besides "don't use a word cloud". It is a silly survey and it invites a silly visualization. Thanks.
Update
Still unable to include hyphenated words (e.g. "fa-so"), they just drop out.
Upvotes: 0
Views: 3034
Reputation: 6826
Looking at wordcloud.py, if the stopwords parameter is None, it uses the builtin STOPWORDS set - so you aren't suppressing use of stopwords. Try calling it with stopwords=set()
.
The built in tokenization in wordcloud.py recognizes a word as a series of alphanumeric characters (so fa-so gets split into fa and so) ignoring case, and also merges simple plurals (e.g. dogs into dog) and ignores single digits. If you want to bypass this, you need to build a list of tuples, each containing a word and its frequency, then call WordCloud.generate_from_frequencies(freqs).
I can't install wordcloud, but this simplified tokenization using \S+ (i.e. it recognizes consecutive non-whitespace characters as a word) in the wordfreq function definitely works:
import re
from operator import itemgetter
words = "do do do do do do do do do do re re re re re mi mi fa-so fa fa fa fa fa fa fa fa fa-so fa-so fa-so fa-so fa-so so la ti do"
item1 = itemgetter(1)
def wordfreq(text):
d = {}
for word in re.findall(r"\S+", text):
# for word in re.findall(r"\w[\w']*", text):
if word.isdigit():
continue
word_lower = word.lower()
# Look in lowercase dict.
if word_lower in d:
d2 = d[word_lower]
else:
d2 = {}
d[word_lower] = d2
# Look in any case dict.
d2[word] = d2.get(word, 0) + 1
d3 = {}
for d2 in d.values():
# Get the most popular case.
first = max(d2.items(), key=item1)[0]
d3[first] = sum(d2.values())
return d3.items()
freqs = wordfreq(words)
print freqs
# prints: [('do', 11), ('la', 1), ('fa-so', 6), ('mi', 2), ('fa', 8), ('so', 1), ('ti', 1), ('re', 5)]
# pass freqs to WordCloud.generate_from_frequencies()
# maybe something like:
# wordcloud = WordCloud(ranks_only = True,stopwords=set()).generate_from_frequencies(freqs)
You can look at the source code of wordcloud.py - you are able to modify that directly or perhaps more safely and update-resistant you can extend/modify behaviour like this example.
Upvotes: 3