Veronica
Veronica

Reputation: 27

R network graphs

I have data where X-column is review and then columns of words that most reviews have. Is it possible to create a graph where nodes would be reviews and edges would be words?

X action age ago amazing american art author back bad beautiful beginning
1   1     1   0    0        0      0    0     0    0     0         0
2   0     0   0    0        0      0    0     0    0     0         0
3   0     0   0    0        0      0    0     0    1     0         0
4   1     3   0    0        0      2    1     0    0     0         0
5   0     0   1    0        1      0    0     2    0     1         0

Another idea is to claster the reviews in the graph according to the used words and their frequency.

Thank you very much. Any help is appreciated.

Upvotes: 0

Views: 253

Answers (2)

lukeA
lukeA

Reputation: 54237

Here are three approaches to explore the relationships in your data:

par(mfrow=c(1,3))

# two mode network (reviews+words)
library(igraph)
set.seed(1)
g <- graph_from_data_frame(subset(reshape2::melt(df, 1), !!value, -value)[2:1])
V(g)$type <- bipartite.mapping(g)$type 
plot(g, layout = layout_as_bipartite(g)[, 2:1], vertex.color = V(g)$type+1L)

# just the reviews:
library(reshape2)
lst <- with(subset(melt(df, 1), !!value)[2:1], split(X, variable))
lst <- lst[lengths(lst)>1]
lst <- lapply(lst, function(x) t(combn(x, m=2)))
g <- graph_from_edgelist(do.call(rbind, lst), dir = F)
E(g)$label <- rep(names(lst), sapply(lst, nrow))
plot(g)

# review clustering
df[-1] %>% dist(meth="bin") %>% hclust %>% plot

Output:

enter image description here

Data:

df <- read.table(header=T, text="
X action age ago amazing american art author back bad beautiful beginning
1   1     1   0    0        0      0    0     0    0     0         0
2   0     0   0    0        0      0    0     0    0     0         0
3   0     0   0    0        0      0    0     0    1     0         0
4   1     3   0    0        0      2    1     0    0     0         0
5   0     0   1    0        1      0    0     2    0     1         0")

PS: There may be a shortcut to no. 2 (reviews as nodes & words as edges) - feel free to add it.

Upvotes: 1

S.K.
S.K.

Reputation: 365

With your example data one could create graph consiting of nodes (each review) and edges (two reviews are connected when they use the same word). Moreover, you could weight the edges according to how many words two reviews have in common and moreover you could use different shapes/colors of the edges to represent the different words.

There are several ways to create a graph with your data. First, to create a adjacency matrix, where each columns and rows would each represent a review. The adjecency matrix only counts whether there is a common word between two reviews or not. In case two reviews share a common word it takes the value 1, otherwise it is zero.

The adjency matrix would look similar to this, where the latters denote column and row labels:

 Review A B C D 
    A   0 1 1 1 
    B   1 0 0 1   
    C   1 0 0 1   
    D   1 1 1 0

With the R command graph_from_adjency( ) in the igraph package you could then create a graph and use the plot functions.

Second you could also create a weight matrix, which counts how many words are shared between two review. Using the same command graph_from_adjency( , weighted=T) from the igraph package you could create from that matrix a graph . You can find a good introduction to network analysis with the igraph package here: http://kateto.net/networks-r-igraph

  Review A B C D 
    A   0 2 3 1 
    B   2 0 0 2   
    C   3 0 0 2   
    D   1 2 2 0

Third, you could specifiy the graph from an edge and nodes data frames.

The nodes data frame would contain a short id of each node and maybe the name and all other information you may want to include about the nodes :

 id long_review_name 
  R1       A
  R2       B 
  R3       C
  R4       D

The edges data frame collects all the information about the connections between two reviews. First, and most important it would record all edges in the columns from and to . Further, it could contain the frequency as weight on the edges and type would denote, which word connection the two nodes share:

 from   to  weight  type 
  R1    R2    1   american
  R1    R2    1   age
  R1    R3    2   american
  R1    R3    1   age 
  R1    R4    1   age 
  R2    R4    2   american

To turn the edges and the node data frame into a graph you would need to use the command graph_from_data_frame(d=links, vertices=nodes).

Upvotes: 1

Related Questions