CodeNoob
CodeNoob

Reputation: 1840

R: grouped data to undirected source target format for network visualization

I have this dataset:

test.set <- read.table(text = "
                            cluster.id   letters  cat
                       1          5       A       pink
                       2          4       B       blue
                       3          4       B       blue
                       4          3       A       pink
                       5          3       E       pink
                       6          3       D       pink
                       7          3       C       pink
                       8          2       A       blue
                       9          2       E       blue
                       10         1       A       green
                       11         0       A       pink
                       12         0       E       pink", header = T, stringsAsFactors = F)

I'm interested in which letters ended up together in what cluster and remember which cat they belong.

Only the clusters containing more > 1 unique letter are relevant (e.g. cluster 4 only contains letter B, and is therefore irrelevant). Let's first filter out all clusters which have at least two distinct letters:

 x <- test.set %>% group_by(cluster.id) %>%
      mutate(letter.count = n_distinct(letters)) %>%
      filter(letter.count > 1) %>% 
      ungroup()



        cluster.id letters   cat letter.count
       <int>   <chr> <chr>        <int>
1          3       A  pink            4
2          3       E  pink            4
3          3       D  pink            4
4          3       C  pink            4
5          2       A  blue            2
6          2       E  blue            2
7          0       A  pink            2
8          0       E  pink            2

Then I want to build a source target matrix/data frame from this. The resulting network will be a undirected network so I only want relations like A -- E (and not including E -- A too). This will result in the following source/target matrix ( If I done it correctly by hand ;) ):

source  target  weight  cat

A   E   2   pink
A   E   1   blue
A   D   1   pink
A   C   1   pink
E   D   1   pink
E   C   1   pink
D   C   1   pink

I'm sure that there is a library (or a really simple trick) to do this, but I can't figure it out. Any smart/simple way to do this?


Half pseudocode example:

# for each cluster in test_set
# get all unique pairwise combinations for the letters:
combinations = unique(expand.grid(letters,letters)) %>%
  filter( .$Var1 != .$Var2 ) #removes self,self combinations
combinations = combinations[!duplicated(apply(combinations,1,function(x) paste(sort(x),collapse=''))),]
# check whether the letter combination + cat is already in the data frame
# if not add it with weight = 1 (e.g. source: A, target: E,  weight: 1, cat: pink)
# else increase the weight by 1 (e.g. source:A, target: E, weigth + 1, cat: pink)

Reaction to comments
@Clarinetist
Maybe I wasn't clear but your code contains a bug.

txt <- "cluster.id\tletters\tcat
10283\tklebsiella pneumoniae\tprotein1
10463\tescherichia coli\tundefined
10463\tmycobacterium tuberculosis\tundefined
10469\tescherichia coli\tundefined
10469\tmycobacterium tuberculosis\tundefined"

txt <- "cluster.id\tletters\tcat
10283\tklebsiella pneumoniae\tprotein1
10463\tescherichia coli\tundefined
10463\tmycobacterium tuberculosis\tundefined
10\tescherichia coli\tundefined
10\tmycobacterium tuberculosis\tundefined"

The first one works fine:

# A tibble: 1 x 4
            source                     target           cat weight
            <fctr>                     <fctr>         <chr>  <dbl>
1 escherichia coli mycobacterium tuberculosis 1046undefined      2

(still that number, 1046 is weird)

The second one should produce the same result (I only changed the id from 10469 to 10), however it produced a wrong result:

# A tibble: 2 x 4
            source                     target           cat weight
            <fctr>                     <fctr>         <chr>  <dbl>
1 escherichia coli mycobacterium tuberculosis 1046undefined      1
2 escherichia coli mycobacterium tuberculosis    1undefined      1

The problems seems to be the numbers in front of the cat column

Upvotes: 0

Views: 78

Answers (1)

Clarinetist
Clarinetist

Reputation: 1187

There's probably a better way to do this.

Here's my attempt:

library(dplyr)
library(gtools)
library(tidyr)

test.set <- read.table(text = "
                            cluster.id   letters  cat
                       1          5       A       pink
                       2          4       B       blue
                       3          4       B       blue
                       4          3       A       pink
                       5          3       E       pink
                       6          3       D       pink
                       7          3       C       pink
                       8          2       A       blue
                       9          2       E       blue
                       10         1       A       green
                       11         0       A       pink
                       12         0       E       pink", header = T, stringsAsFactors = F)

x <- test.set %>% 
  group_by(cluster.id) %>%
  mutate(letter.count = n_distinct(letters)) %>%
  filter(letter.count > 1) %>% 
  select(-letter.count) %>%
  ungroup() %>%
  mutate(id = paste(cluster.id, cat),
         ind = 1) %>%
  select(-cluster.id, -cat) %>%
  spread(key = letters, value = ind) %>%
  as.data.frame()

row.names(x) <- x$id
x$id <- NULL

for (i in 1:nrow(x)){
  cluster <- x[i,]
  letters <- names(cluster)[which(!is.na(cluster))]
  comb <- combinations(n = length(letters),
                       r = 2,
                       v = letters) %>%
    as.data.frame() %>%
    rename(source = V1,
           target = V2)
  cat <- sub(". ", "", dimnames(cluster)[[1]])
  comb$cat <- cat
  rm(cat)
  comb$weight <- 0
  cluster_cols <- dimnames(cluster)[[2]]
  for (j in 1:nrow(comb)){
    comb$weight[j] <- min(cluster[cluster_cols == comb$source[j]],
                          cluster[cluster_cols == comb$target[j]]) + comb$weight[j]
  }
  rm(j)
  rm(cluster_cols)
  if (i == 1){
    final <- comb
    rm(comb)
  } else {
    final <- rbind(final, comb)
    rm(comb)
  }
}

rm(i)

final <- final %>%
  group_by(source, target, cat) %>%
  summarize(weight = sum(weight)) %>%
  ungroup()

> final
# A tibble: 7 x 4
  source target   cat weight
  <fctr> <fctr> <chr>  <dbl>
1      A      E  blue      1
2      A      E  pink      2
3      A      C  pink      1
4      A      D  pink      1
5      C      E  pink      1
6      C      D  pink      1
7      D      E  pink      1

With the updated data set (change cat <- sub(". ", "", dimnames(cluster)[[1]]) to cat <- sub(".* ", "", dimnames(cluster)[[1]])):

    txt <- "cluster.id\tletters\tcat
10283\tklebsiella pneumoniae\tprotein1
10463\tescherichia coli\tundefined
10463\tmycobacterium tuberculosis\tundefined
10\tescherichia coli\tundefined
10\tmycobacterium tuberculosis\tundefined"

library(dplyr)
library(gtools)
library(tidyr)
library(stringr)

test.set <- read.table(text = txt, header = TRUE, sep = "\t", stringsAsFactors = FALSE) %>%
  as.data.frame()
rm(txt)

x <- test.set %>% 
  group_by(cluster.id) %>%
  mutate(letter.count = n_distinct(letters)) %>%
  filter(letter.count > 1) %>% 
  select(-letter.count) %>%
  ungroup() %>%
  mutate(id = paste(cluster.id, cat),
         ind = 1) %>%
  select(-cluster.id, -cat) %>%
  spread(key = letters, value = ind) %>%
  as.data.frame()

row.names(x) <- x$id
x$id <- NULL

for (i in 1:nrow(x)){
  cluster <- x[i,]
  letters <- names(cluster)[which(!is.na(cluster))]
  comb <- combinations(n = length(letters),
                       r = 2,
                       v = letters) %>%
    as.data.frame() %>%
    rename(source = V1,
           target = V2)
  cat <- sub(".* ", "", dimnames(cluster)[[1]])
  comb$cat <- cat
  rm(cat)
  comb$weight <- 0
  cluster_cols <- dimnames(cluster)[[2]]
  for (j in 1:nrow(comb)){
    comb$weight[j] <- min(cluster[cluster_cols == comb$source[j]],
                          cluster[cluster_cols == comb$target[j]]) + comb$weight[j]
  }
  rm(j)
  rm(cluster_cols)
  if (i == 1){
    final <- comb
    rm(comb)
  } else {
    final <- rbind(final, comb)
    rm(comb)
  }
}

rm(i)

final <- final %>%
  group_by(source, target, cat) %>%
  summarize(weight = sum(weight)) %>%
  ungroup()

Output:

# A tibble: 1 x 4
            source                     target       cat weight
            <fctr>                     <fctr>     <chr>  <dbl>
1 escherichia coli mycobacterium tuberculosis undefined      2

Upvotes: 1

Related Questions