John Conroy
John Conroy

Reputation: 155

How to sample a scale-free graph

Given a large scale-free graph (a social network graph), what's the best way to sample it such that the sample retains an acceptable abstraction of the properties of the original?

I have a large graph (Munmun's twitter dataset, if you know it). But I need a connected sample of that graph with a reasonably large diameter (tl;dr... reasons why on request... a diameter of 10 would be good).

The problem is that any kinda breadth-first search always is likely to come across some massively connected nodes. So I start such a search, getting the friends of all nodes which I come across. I inevitably come across some massively-connected nodes, and have to get all their friends. This is a problem because I end up with a large number of nodes which are close to each other in the graph. To make programmatic analysis feasible, I have to limit the number of nodes (and edges). The whole point of this exercise is to find shortest paths between nodes, so I'm generally interested in ALL of a node's neighbours. And that's the problem.

One hack around this is to limit the max. number of nodes connected to a user which I'm interested in. For instance, if I come across @barackobama in my breadth-first search, I ensure that I only accept some small proportion of his friends and ignore the rest. But would this hacked graph be worth a damn, or am I losing too much information in terms of finding shortest paths??

Hope that makes sense...

Upvotes: 2

Views: 632

Answers (3)

user3650263
user3650263

Reputation: 1

You might want to check the following: Gscaler: https://github.com/jayCool/Gscaler This is a recent tool which produces synthetic scaled graphs.

It contains the jar file and the related paper for your reference.

Upvotes: 0

Vincent Labatut
Vincent Labatut

Reputation: 1808

Several sampling methods exist, how to choose one depends (amongst other things) of the properties you want to preserve. I found the literature review (section 3) in the thesis Sampling and Inference in Complex Networks [Maiya '11] very informative, for that matter.

But you seem to have found a way of sampling your network, and now you want to find out if the sample is representative of the whole graph in terms of shortest paths. You can try to have a look at this paper: Complex Network Measurements: Estimating the Relevance of Observed Properties [Latapy & Magnien '08]. They describe a method to assess the representativeness of a sample, regarding various classic topological properties. To summarize their approach, they initially have access to the whole studied network, and simulate some sampling process on these data, with increasing sample size. They monitor how properties evolve depending on sample size, and decide of an appropriate size when the properties of interest are stable enough. Their tool is freely available online.

Edit: the only ready-to-use tool I could find online is Albatross. The associated article Albatross Sampling: Robust and Effective Hybrid Vertex Sampling for Social Graphs [Jin et al. '11] also contains a nice review of existing sampling methods, some of which are implemented in the source code they provide.

Edit 2: I needed to use Albatross on a Linux system, so I did a Java port. It's very raw, but it seems to work fine. It's available on GitHub: https://github.com/vlabatut/Albatross

Upvotes: 1

bjoernz
bjoernz

Reputation: 3852

I am not sure, if I understand your question correctly. I think the main question you have is, about how you can compute the shortest path of two nodes in a giant, directed graph. Creating a subsample of the graph seems to be your attempt to create an efficient solution. (But I probably misunderstood you completely.)

Perhaps this SO-Question has some pointers for you: Efficiently finding the shortest path in large graphs

The graphs in that question seem to be significantly smaller, though.

Upvotes: 0

Related Questions