Reputation: 1840
I have this dataset:
test.set <- read.table(text = "
cluster.id letters cat
1 5 A pink
2 4 B blue
3 4 B blue
4 3 A pink
5 3 E pink
6 3 D pink
7 3 C pink
8 2 A blue
9 2 E blue
10 1 A green
11 0 A pink
12 0 E pink", header = T, stringsAsFactors = F)
I'm interested in which letters ended up together in what cluster and remember which cat
they belong.
Only the clusters containing more > 1 unique letter are relevant (e.g. cluster 4 only contains letter B, and is therefore irrelevant). Let's first filter out all clusters which have at least two distinct letters:
x <- test.set %>% group_by(cluster.id) %>%
mutate(letter.count = n_distinct(letters)) %>%
filter(letter.count > 1) %>%
ungroup()
cluster.id letters cat letter.count
<int> <chr> <chr> <int>
1 3 A pink 4
2 3 E pink 4
3 3 D pink 4
4 3 C pink 4
5 2 A blue 2
6 2 E blue 2
7 0 A pink 2
8 0 E pink 2
Then I want to build a source target matrix/data frame from this. The resulting network will be a undirected network so I only want relations like A -- E (and not including E -- A too). This will result in the following source/target matrix ( If I done it correctly by hand ;) ):
source target weight cat
A E 2 pink
A E 1 blue
A D 1 pink
A C 1 pink
E D 1 pink
E C 1 pink
D C 1 pink
I'm sure that there is a library (or a really simple trick) to do this, but I can't figure it out. Any smart/simple way to do this?
Half pseudocode example:
# for each cluster in test_set
# get all unique pairwise combinations for the letters:
combinations = unique(expand.grid(letters,letters)) %>%
filter( .$Var1 != .$Var2 ) #removes self,self combinations
combinations = combinations[!duplicated(apply(combinations,1,function(x) paste(sort(x),collapse=''))),]
# check whether the letter combination + cat is already in the data frame
# if not add it with weight = 1 (e.g. source: A, target: E, weight: 1, cat: pink)
# else increase the weight by 1 (e.g. source:A, target: E, weigth + 1, cat: pink)
Reaction to comments
@Clarinetist
Maybe I wasn't clear but your code contains a bug.
txt <- "cluster.id\tletters\tcat
10283\tklebsiella pneumoniae\tprotein1
10463\tescherichia coli\tundefined
10463\tmycobacterium tuberculosis\tundefined
10469\tescherichia coli\tundefined
10469\tmycobacterium tuberculosis\tundefined"
txt <- "cluster.id\tletters\tcat
10283\tklebsiella pneumoniae\tprotein1
10463\tescherichia coli\tundefined
10463\tmycobacterium tuberculosis\tundefined
10\tescherichia coli\tundefined
10\tmycobacterium tuberculosis\tundefined"
The first one works fine:
# A tibble: 1 x 4
source target cat weight
<fctr> <fctr> <chr> <dbl>
1 escherichia coli mycobacterium tuberculosis 1046undefined 2
(still that number, 1046 is weird)
The second one should produce the same result (I only changed the id from 10469 to 10), however it produced a wrong result:
# A tibble: 2 x 4
source target cat weight
<fctr> <fctr> <chr> <dbl>
1 escherichia coli mycobacterium tuberculosis 1046undefined 1
2 escherichia coli mycobacterium tuberculosis 1undefined 1
The problems seems to be the numbers in front of the cat
column
Upvotes: 0
Views: 78
Reputation: 1187
There's probably a better way to do this.
Here's my attempt:
library(dplyr)
library(gtools)
library(tidyr)
test.set <- read.table(text = "
cluster.id letters cat
1 5 A pink
2 4 B blue
3 4 B blue
4 3 A pink
5 3 E pink
6 3 D pink
7 3 C pink
8 2 A blue
9 2 E blue
10 1 A green
11 0 A pink
12 0 E pink", header = T, stringsAsFactors = F)
x <- test.set %>%
group_by(cluster.id) %>%
mutate(letter.count = n_distinct(letters)) %>%
filter(letter.count > 1) %>%
select(-letter.count) %>%
ungroup() %>%
mutate(id = paste(cluster.id, cat),
ind = 1) %>%
select(-cluster.id, -cat) %>%
spread(key = letters, value = ind) %>%
as.data.frame()
row.names(x) <- x$id
x$id <- NULL
for (i in 1:nrow(x)){
cluster <- x[i,]
letters <- names(cluster)[which(!is.na(cluster))]
comb <- combinations(n = length(letters),
r = 2,
v = letters) %>%
as.data.frame() %>%
rename(source = V1,
target = V2)
cat <- sub(". ", "", dimnames(cluster)[[1]])
comb$cat <- cat
rm(cat)
comb$weight <- 0
cluster_cols <- dimnames(cluster)[[2]]
for (j in 1:nrow(comb)){
comb$weight[j] <- min(cluster[cluster_cols == comb$source[j]],
cluster[cluster_cols == comb$target[j]]) + comb$weight[j]
}
rm(j)
rm(cluster_cols)
if (i == 1){
final <- comb
rm(comb)
} else {
final <- rbind(final, comb)
rm(comb)
}
}
rm(i)
final <- final %>%
group_by(source, target, cat) %>%
summarize(weight = sum(weight)) %>%
ungroup()
> final
# A tibble: 7 x 4
source target cat weight
<fctr> <fctr> <chr> <dbl>
1 A E blue 1
2 A E pink 2
3 A C pink 1
4 A D pink 1
5 C E pink 1
6 C D pink 1
7 D E pink 1
With the updated data set (change cat <- sub(". ", "", dimnames(cluster)[[1]])
to cat <- sub(".* ", "", dimnames(cluster)[[1]])
):
txt <- "cluster.id\tletters\tcat
10283\tklebsiella pneumoniae\tprotein1
10463\tescherichia coli\tundefined
10463\tmycobacterium tuberculosis\tundefined
10\tescherichia coli\tundefined
10\tmycobacterium tuberculosis\tundefined"
library(dplyr)
library(gtools)
library(tidyr)
library(stringr)
test.set <- read.table(text = txt, header = TRUE, sep = "\t", stringsAsFactors = FALSE) %>%
as.data.frame()
rm(txt)
x <- test.set %>%
group_by(cluster.id) %>%
mutate(letter.count = n_distinct(letters)) %>%
filter(letter.count > 1) %>%
select(-letter.count) %>%
ungroup() %>%
mutate(id = paste(cluster.id, cat),
ind = 1) %>%
select(-cluster.id, -cat) %>%
spread(key = letters, value = ind) %>%
as.data.frame()
row.names(x) <- x$id
x$id <- NULL
for (i in 1:nrow(x)){
cluster <- x[i,]
letters <- names(cluster)[which(!is.na(cluster))]
comb <- combinations(n = length(letters),
r = 2,
v = letters) %>%
as.data.frame() %>%
rename(source = V1,
target = V2)
cat <- sub(".* ", "", dimnames(cluster)[[1]])
comb$cat <- cat
rm(cat)
comb$weight <- 0
cluster_cols <- dimnames(cluster)[[2]]
for (j in 1:nrow(comb)){
comb$weight[j] <- min(cluster[cluster_cols == comb$source[j]],
cluster[cluster_cols == comb$target[j]]) + comb$weight[j]
}
rm(j)
rm(cluster_cols)
if (i == 1){
final <- comb
rm(comb)
} else {
final <- rbind(final, comb)
rm(comb)
}
}
rm(i)
final <- final %>%
group_by(source, target, cat) %>%
summarize(weight = sum(weight)) %>%
ungroup()
Output:
# A tibble: 1 x 4
source target cat weight
<fctr> <fctr> <chr> <dbl>
1 escherichia coli mycobacterium tuberculosis undefined 2
Upvotes: 1