Tal Galili
Tal Galili

Reputation: 25336

What's the best way to map the link connection between blogs?

I wish to perform a social network analysis on a bunch of blogs, plotting who is linking to who (not just by their blogroll but also inside their posts). What software can perform such crawling/data-collecting/mapping ?

Thanks!

Upvotes: 0

Views: 281

Answers (4)

Matt Luongo
Matt Luongo

Reputation: 14849

For the record, I highly recommend the mechanize library in Python- it makes building your own personalized crawler/scraper a snap.

Upvotes: 0

doug
doug

Reputation: 70068

By "mapping" I'm not sure if you are referring to mapping of raw data to an orthodox graph data structure or mapping of that data structure to an aesthetics library in order to render it. If the former, then i would guess it's a straightforward matter of writing a function to translate raw data (w/r/t which blogs link to which, and how much) into a graph data structure, such as an adjacency matrix. Mapping such a data structure for viewing can be done like this:

library(Rgraphviz)
# create an synthetic adjacency matrix for 10 blogs
M = sapply(rep(10, 10), function(x){sample(c(0, 1), 10, T, c(0.7, 0.3))})
colnames(M) = paste(rep("b", 10), 1:10, sep="-")
rownames(M) = colnames(M) 
# 0's down the main diagonal (eliminate self-edges)
diag(M) = rep(0, 10)
# call the graphviz constructor, passing in adjacency matrix
M_gr = new("graphAM", adjMat=M, edgemode="directed")
g1 = layoutGraph(M_gr)
# (optional) aesthetic parameters for nodes & edges
graph.par( list(edges = list(col="gray", lty="dashed", lwd=1), 
            nodes = list( col="midnightblue", shape="ellipse", 
               textCol="darkred", fill="#B0B7C6", fontsize=11, 
               lty="dotted", lwd=2)) )
# call the device driver
png(file='somefilename.png', width=600, height=460, res=128)
# call the plot function
renderGraph(g1)
# kill the device
dev.off()

alt text http://img13.imageshack.us/img13/7683/bloggraph.png

If you want to show not just connections but the strength of those connections, e.g., number, or perhaps frequency of links from one blog to another, you can do that by setting line thickness individually, through the parameter 'lwd', which i've set at 2 for all edges, for this example (another option is to show connection strength by line type, e.g., dotted, dashed, solid, color). Of course, these edge weights will have to be set in your adjacency matrix, which is simple enough--instead of '0'/'1' to represent 'not connected'/connected, you'll probably want to use '0'/'integers'.

Upvotes: 3

Shane
Shane

Reputation: 100194

You could also do this in R with a combination of something like RCurl or XML (to get the blog posts) and something like igraph (for the SNA). You will need to parse the HTML to get all the links, and the XML package can handle that kind of processing very easily.

Have a look at this related question for some pointers on the SNA analysis, although this is a big field of study.

Upvotes: 2

Paul Tomblin
Paul Tomblin

Reputation: 182802

Nutch is a decent enough crawler, but you'd have to do your own analysis on the indexed data.

Upvotes: 1

Related Questions