Reputation: 3294
For my CAS term paper of the Basic DataScience Semester I'm parsing a news website with all the articles and their meta data (author, title, subtitle, summary, tags, category, subcategory, creation dt, update dt etc.) - inspired by this guy https://www.youtube.com/watch?v=-YpwsdRKt8Q
Everything works quite well, my raspberry pi is collecting this data all 15 mins and so on.
I just have one problem. I'd like to create a correlation network out of the tags. This tag column looks like
0 panorama,schweiz,verkehr,news 1 sport,schweiz,eishockey,news 2 stans,panorama,verkehr,strassenverkehr,news 3 eishockey,sport,davos,news 4 wirtschaft,schweiz,konsum,kaffeetee,news 5 jeanclaudegerber,news,srilanka,tiere,wissen 6 schule,bellinzona,panorama,news 7 luzern,jrgenklopp,fussball,news 8 panorama,klima,gretathunberg,lissabon,news 9 australien,vermisstmeldung,gesellschaft,news 10 gesellschaft,amerika,news,ausstellung
Now I wanna calculate the correlations between the tags. e.g. in the first row "panorama" has 1 line to "schweiz", "verkehr", "news" "schweiz" has 1 line to "panorama", "verkehr", "news" and so on. Sometimes there are 3 tags, sometimes up to 7, 8.
I'd like to have a script running through all the lines and calculate this correlation and summarize it up.
First question, could someone give me a hint how I can do this? Are there moduls which could help? Even for a small hint I would be very grateful.
And last question, could someone also give me a hint how I could visualize this. I'd like to have a network plot where I see whole map. The most common tags bigger ans the linewidth of the most common connections also thicker.
My mainproblem is, I even don't know for what I have to look. You probably noticed that english is not my native language and in german I haven't found something that really helped me ;-)
Thanks a lot and Cheers from Switzerland marco
edit, PS: in order to specify more properly. All the tags in the list are tags. So if I have:
panorama,schweiz,verkehr,news
These are 4 tags and everyone of it is related to the other three ones.
Upvotes: 0
Views: 151
Reputation: 1843
I think the first thing you'll want to do is count the occurence of each category for each tag, so starting with a Pandas dataframe with tags
as the index:
df =
1 2 3 4
tags
panorama schweiz verkehr news None
sport schweiz eishockey news None
stans panorama verkehr strassenverkehr news
eishockey sport davos news None
wirtschaft schweiz konsum kaffeetee news
jeanclaudegerber news srilanka tiere wissen
schule bellinzona panorama news None
luzern jrgenklopp fussball news None
panorama klima gretathunberg lissabon news
australien vermisstmeldung gesellschaft news None
gesellschaft amerika news ausstellung None
I would:
# This does all of the above at once
counts = df.stack().rename('category').reset_index('tags').groupby('tags').category.value_counts()
Which gives
counts =
tags category
australien gesellschaft 1
news 1
vermisstmeldung 1
eishockey davos 1
news 1
sport 1
gesellschaft amerika 1
ausstellung 1
news 1
jeanclaudegerber news 1
srilanka 1
tiere 1
wissen 1
luzern fussball 1
jrgenklopp 1
news 1
panorama news 2
gretathunberg 1
klima 1
lissabon 1
schweiz 1
verkehr 1
schule bellinzona 1
news 1
panorama 1
sport eishockey 1
news 1
schweiz 1
stans news 1
panorama 1
strassenverkehr 1
verkehr 1
wirtschaft kaffeetee 1
konsum 1
news 1
schweiz 1
Name: category, dtype: int64
Then you can unstack this series to give a table:
counts.unstack()
category amerika ausstellung bellinzona davos eishockey \
tags
australien NaN NaN NaN NaN NaN
eishockey NaN NaN NaN 1.0 NaN
gesellschaft 1.0 1.0 NaN NaN NaN
jeanclaudegerber NaN NaN NaN NaN NaN
luzern NaN NaN NaN NaN NaN
panorama NaN NaN NaN NaN NaN
schule NaN NaN 1.0 NaN NaN
sport NaN NaN NaN NaN 1.0
stans NaN NaN NaN NaN NaN
wirtschaft NaN NaN NaN NaN NaN
...
Then you can do correlations on that matrix
Upvotes: 2