Correlation network

Question

For my CAS term paper of the Basic DataScience Semester I'm parsing a news website with all the articles and their meta data (author, title, subtitle, summary, tags, category, subcategory, creation dt, update dt etc.) - inspired by this guy https://www.youtube.com/watch?v=-YpwsdRKt8Q

Everything works quite well, my raspberry pi is collecting this data all 15 mins and so on.

I just have one problem. I'd like to create a correlation network out of the tags. This tag column looks like

0 panorama,schweiz,verkehr,news 1 sport,schweiz,eishockey,news 2 stans,panorama,verkehr,strassenverkehr,news 3 eishockey,sport,davos,news 4 wirtschaft,schweiz,konsum,kaffeetee,news 5 jeanclaudegerber,news,srilanka,tiere,wissen 6 schule,bellinzona,panorama,news 7 luzern,jrgenklopp,fussball,news 8 panorama,klima,gretathunberg,lissabon,news 9 australien,vermisstmeldung,gesellschaft,news 10 gesellschaft,amerika,news,ausstellung

Now I wanna calculate the correlations between the tags. e.g. in the first row "panorama" has 1 line to "schweiz", "verkehr", "news" "schweiz" has 1 line to "panorama", "verkehr", "news" and so on. Sometimes there are 3 tags, sometimes up to 7, 8.

I'd like to have a script running through all the lines and calculate this correlation and summarize it up.

First question, could someone give me a hint how I can do this? Are there moduls which could help? Even for a small hint I would be very grateful.
And last question, could someone also give me a hint how I could visualize this. I'd like to have a network plot where I see whole map. The most common tags bigger ans the linewidth of the most common connections also thicker.

My mainproblem is, I even don't know for what I have to look. You probably noticed that english is not my native language and in german I haven't found something that really helped me ;-)

Thanks a lot and Cheers from Switzerland marco

edit, PS: in order to specify more properly. All the tags in the list are tags. So if I have:

panorama,schweiz,verkehr,news

These are 4 tags and everyone of it is related to the other three ones.

Michael Silverstein · Accepted Answer

I think the first thing you'll want to do is count the occurence of each category for each tag, so starting with a Pandas dataframe with tags as the index:

df =
                                1              2                3       4
tags                                                                     
panorama                  schweiz        verkehr             news    None
sport                     schweiz      eishockey             news    None
stans                    panorama        verkehr  strassenverkehr    news
eishockey                   sport          davos             news    None
wirtschaft                schweiz         konsum        kaffeetee    news
jeanclaudegerber             news       srilanka            tiere  wissen
schule                 bellinzona       panorama             news    None
luzern                 jrgenklopp       fussball             news    None
panorama                    klima  gretathunberg         lissabon    news
australien        vermisstmeldung   gesellschaft             news    None
gesellschaft              amerika           news      ausstellung    None

I would:

Stack the dataframe (a special case of melting where all columns are melted)
Groupby the tags
Count the number of times each category appears for each tag

# This does all of the above at once
counts = df.stack().rename('category').reset_index('tags').groupby('tags').category.value_counts()

Which gives

counts = 
tags              category       
australien        gesellschaft       1
                  news               1
                  vermisstmeldung    1
eishockey         davos              1
                  news               1
                  sport              1
gesellschaft      amerika            1
                  ausstellung        1
                  news               1
jeanclaudegerber  news               1
                  srilanka           1
                  tiere              1
                  wissen             1
luzern            fussball           1
                  jrgenklopp         1
                  news               1
panorama          news               2
                  gretathunberg      1
                  klima              1
                  lissabon           1
                  schweiz            1
                  verkehr            1
schule            bellinzona         1
                  news               1
                  panorama           1
sport             eishockey          1
                  news               1
                  schweiz            1
stans             news               1
                  panorama           1
                  strassenverkehr    1
                  verkehr            1
wirtschaft        kaffeetee          1
                  konsum             1
                  news               1
                  schweiz            1
Name: category, dtype: int64

Then you can unstack this series to give a table:

counts.unstack()

category          amerika  ausstellung  bellinzona  davos  eishockey  \
tags                                                                   
australien            NaN          NaN         NaN    NaN        NaN   
eishockey             NaN          NaN         NaN    1.0        NaN   
gesellschaft          1.0          1.0         NaN    NaN        NaN   
jeanclaudegerber      NaN          NaN         NaN    NaN        NaN   
luzern                NaN          NaN         NaN    NaN        NaN   
panorama              NaN          NaN         NaN    NaN        NaN   
schule                NaN          NaN         1.0    NaN        NaN   
sport                 NaN          NaN         NaN    NaN        1.0   
stans                 NaN          NaN         NaN    NaN        NaN   
wirtschaft            NaN          NaN         NaN    NaN        NaN
...

Then you can do correlations on that matrix

Correlation network

Answers (1)

Related Questions