Reputation: 11
I wanted to see if I could generate a wordcloud on my fake dataframe, but I'm running in quite some trouble. I used the code from this website: https://www.analyticsvidhya.com/blog/2020/04/beginners-guide-exploratory-data-analysis-text-data/
This is the code I have thus far:
Making the dataframe
import pandas as pd
import numpy as np
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
s1 = 'The fox and the hound walk through the woods together and find a bunny'
s2 = 'The closet is made of glass'
s3 = 'Cheetahs are one of the fastest animals in the world'
s4 = 'Vincent van Gogh was a phenomenal artist'
s5 = 'Once upon a time there was an evil queen who cast a glorious curse'
s6 = 'Emma and Regina would have been a perfect power couple'
s7 = 'The iphones camera is way worse than the android camera'
s8 = 'who even wears only white socks? Thats boring'
s9 = 'Ebby is the most precious dog in the whole world'
s10 = 'Birds go chirp chirp chirp'
viralc = [1, 0 ,1, 1, 0, 1, 0,1,1, 0]
df = pd.DataFrame({'titles':[s1, s2, s3, s4, s5, s6, s7, s8, s9, s10], 'viral':viralc})
df
Using group_by and turning into tf-idf
df_grouped=df[['titles', 'viral']].groupby(by='viral').agg(lambda x:' '.join(x))
df_grouped.head()
*Attempting (and miserably failing) to generate a wordcloud:
# Importing wordcloud for plotting word clouds and textwrap for wrapping longer text
from wordcloud import WordCloud
from textwrap import wrap
# Function for generating word clouds
def generate_wordcloud(data,title):
wc = WordCloud(width=400, height=330, max_words=150,colormap="Dark2", font_path='C:\\Users\\Romy\\Documents\\Studie\\DataScience_Master\\Thesis\\Fonts\\arial.ttf').generate_from_frequencies(data.to_dict())
plt.figure(figsize=(10,8))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.title('\n'.join(wrap(title,60)),fontsize=13)
plt.show()
# Transposing document term matrix
df_dtm=df_dtm.transpose()
df_dtm# Plotting word cloud for each product
for index,product in enumerate(df_dtm.columns):
generate_wordcloud(df_dtm[product].sort_values(ascending=False), product)
This gives me either the error that wordcloud only supports Truetypefonts (which the font I put in, is), or this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [106], in <cell line: 18>()
17 df_dtm# Plotting word cloud for each product
18 for index,product in enumerate(df_dtm.columns):
---> 19 generate_wordcloud(df_dtm[product].sort_values(ascending=False), product)
Input In [106], in generate_wordcloud(data, title)
6 def generate_wordcloud(data,title):
----> 7 wc = WordCloud(width=400, height=330, max_words=150,colormap="Dark2", font_path='C:\\Users\\Romy\\Documents\\Studie\\DataScience_Master\\Thesis\\Fonts\\arial.ttf').generate_from_frequencies(data.to_dict())
8 plt.figure(figsize=(10,8))
9 plt.imshow(wc, interpolation='bilinear')
File ~\anaconda3\lib\site-packages\wordcloud\wordcloud.py:453, in WordCloud.generate_from_frequencies(self, frequencies, max_font_size)
451 font_size = self.height
452 else:
--> 453 self.generate_from_frequencies(dict(frequencies[:2]),
454 max_font_size=self.height)
455 # find font sizes
456 sizes = [x[1] for x in self.layout_]
File ~\anaconda3\lib\site-packages\wordcloud\wordcloud.py:508, in WordCloud.generate_from_frequencies(self, frequencies, max_font_size)
505 transposed_font = ImageFont.TransposedFont(
506 font, orientation=orientation)
507 # get size of resulting text
--> 508 box_size = draw.textbbox((0, 0), word, font=transposed_font, anchor="lt")
509 # find possible places using integral image:
510 result = occupancy.sample_position(box_size[3] + self.margin,
511 box_size[2] + self.margin,
512 random_state)
File ~\anaconda3\lib\site-packages\PIL\ImageDraw.py:653, in ImageDraw.textbbox(self, xy, text, font, anchor, spacing, align, direction, features, language, stroke_width, embedded_color)
650 if embedded_color and self.mode not in ("RGB", "RGBA"):
651 raise ValueError("Embedded color supported only in RGB and RGBA modes")
--> 653 if self._multiline_check(text):
654 return self.multiline_textbbox(
655 xy,
656 text,
(...)
665 embedded_color,
666 )
668 if font is None:
File ~\anaconda3\lib\site-packages\PIL\ImageDraw.py:368, in ImageDraw._multiline_check(self, text)
365 """Draw text."""
366 split_character = "\n" if isinstance(text, str) else b"\n"
--> 368 return split_character in text
TypeError: argument of type 'int' is not iterable
There are a few things I must mention:
It says there's a newer version of pip available, but when I run the code it suggests, it doesn't install the newer version. And I don't know if this newer version is needed to use wordcloud.
It also says there is a weird (weird is not the right word but basically it's not supposed to be like that) package installed called 'illow', which I am guessing is supposed to be pillow. And I know you do need pillow but it seems as if pillow is actually installed, namely version 9.5.0
I then figured that maybe I could try running my code on our schools GPU, since I will need to anyway for the final code, and maybe installing things there was easier (spoiler alert: it was not): I ran this code to install wordcloud in my environment (as told by the anaconda website, which I have used for installing other packages as well):
conda install -c conda-forge wordcloud
but I got this error:
> Retrieving notices: ...working... done Collecting package metadata
> (current_repodata.json): done Solving environment: failed with initial
> frozen solve. Retrying with flexible solve. Solving environment:
> failed with repodata from current_repodata.json, will retry with next
> repodata source. Collecting package metadata (repodata.json): done
> Solving environment: failed with initial frozen solve. Retrying with
> flexible solve. Solving environment: - Found conflicts! Looking for
> incompatible packages. This can take several minutes. Press CTRL-C to
> abort. failed
>
> UnsatisfiableError: The following specifications were found to be
> incompatible with the existing python installation in your
> environment:
>
> Specifications:
>
> - wordcloud -> python[version='2.7.*|3.5.*|3.6.*|>=2.7,<2.8.0a0|>=3.10,<3.11.0a0|>=3.8,<3.9.0a0|>=3.9,<3.10.0a0|>=3.7,<3.8.0a0|>=3.6,<3.7.0a0|3.4.*']
>
> Your python: python=3.11
>
> If python is on the left-most side of the chain, that's the version
> you've asked for. When python appears to the right, that indicates
> that the thing on the left is somehow not available for the python
> version you are constrained to. Note that conda will not change your
> python version to a different minor version unless you explicitly
> specify that.
>
> The following specifications were found to be incompatible with your
> system:
>
> - feature:/linux-64::__glibc==2.31=0
> - python=3.11 -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
> - wordcloud -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
>
> Your installed version is: 2.31
And... Well I mean I do a data science master but they have never taught us anything about computer science or how installing everything works, and I am lost on where to even begin solving this and if it is even needed to use wordcloud.
Does anyone have any idea for what I can do?
Upvotes: 0
Views: 1161
Reputation: 16
For the viral labels that you have defined, use strings instead of integer values. Replace your declaration of the viralc list with this
viralc = ["1", "0" ,"1", "1", "0", "1", "0", "1", "1", "0"]
I am getting these word clouds for viral = 0 and 1.
Upvotes: 0