Reputation: 43
I have 2 datasets with text data. dataset1: clusters11 and dataset2: transactns11. I am using For loop in the code and it takes more time to process the data. So can anyone suggest how I can reduce the time taken. Please go through the below sample datasets and the code I am using.
dataset1: clusters11
cluster_id clusters
1 A,B,C
2 A,B
3 B,C
4 C,D,E
5 B,C,D
6 D,E,F
7 A,D,F
8 B,G,H
9 B,C,F
10 G,H,M
11 A,H,N
12 B,C,M
dataset2: transactns11
unique_id skills
221 A,B,C
223 A,B
224 B,C
225 C,D,E,F
226 B,C,D,M
227 D,E,F,A
228 A,D,F
229 B,G,H
230 B,C,F,A
231 G,H,M
232 A,H,N
233 B,C,M
234 A,B,C
235 A,B
236 B,C
237 C,D,E
238 B,C,D
239 D,E,F
I want to have my output:out_df like this( which ever cluster in clusters11 file matches 100% with the skills in transactns11 should appear as comma separate by cluster_id, with the column name matching_values in the output.
unique_id skills matching_values
221 A,B,C 1,2,3
223 A,B 2
224 B,C 3
225 C,D,E,F 4,6
226 B,C,D,M 3,5,12
227 D,E,F,A 6,7
228 A,D,F 7
229 B,G,H 8
230 B,C,F,A 1,2,3,9
231 G,H,M 10
232 A,H,N 11
233 B,C,M 3,12
234 A,B,C 1,2,3
235 A,B 2
236 B,C 3
237 C,D,E 4
238 B,C,D 3,5
239 D,E,F 6
For this I have code and it is working fine with me
out_df <- data.frame(matrix(0, ncol = 3, nrow = nrow(transactns11)))
colnames(out_df) <- c("unique_id", "skills", "matching_values")
transactns11$skills=as.character(transactns11$skills)
clusters11$clusters=as.character(clusters11$clusters)
for(n in 1:nrow(transactns11)) {
trans1 <- strsplit(transactns11$skills[n], ",")
trans1
transvect <- unlist(trans1)
transvect
matching_clusters <- c("")
out_df$unique_id[n] <- as.character(transactns11$unique_id[n])
out_df$skills[n] <- transactns11$skills[n]
for(i in 1:nrow(clusters11)) {
cluster1 <- strsplit(clusters11$clusters[i],",")
cluster1
clustervect <- unlist(cluster1)
clustervect
if(length(intersect(transvect,clustervect)) == length(clustervect)) {
matching_clusters <- paste(matching_clusters, clusters11$cluster_id[i], sep = ",")
}
}
matching_clusters <- substring(matching_clusters,2)
out_df$matching_values[n] <- matching_clusters
}
Problem is that, the code takes more time when the recods of the 2 files are more like 10000 or 20000( time taken will be 15 to 20 hrs). Can any one please suggest how I can avoid this loops or any other method I can follow to improve the code and reduce the huge time taken to process. Thanks in Advance. Regards, hari
Upvotes: 1
Views: 506
Reputation: 39154
This solution uses functions from dplyr
, tidyr
, and purrr
. So I loaded the tidyverse
package for all these functions. dt_final
would be the final output.
# Load package
library(tidyverse)
# Separate the skills and cluster letters
dt1_update <- clusters11 %>%
mutate(Group = strsplit(clusters, split = ",")) %>%
unnest(Group) %>%
select(-clusters)
dt2_update <- transactns11 %>%
mutate(Group = strsplit(skills, split = ",")) %>%
unnest(Group) %>%
select(-skills)
# Split the data frame
dt1_list <- split(dt1_update, f = dt1_update$cluster_id)
dt2_list <- split(dt2_update, f = dt2_update$unique_id)
# Design a function to determine if the Group elements matched
# df1 is data frame from dt1_list, df2 is data frame from dt2_list
is.match <- function(df1, df2){
if (all(df1$Group %in% df2$Group)){
return(TRUE)
} else {
return(FALSE)
}
}
# Design a function to return the ID
id.return <- function(df2, df1_list){
answer <- map(df1_list, .f = is.match, df2 = df2)
answer <- names(which(unlist(answer)))
return(paste(answer, collapse = ","))
}
# Design a function to apply all the id.return function to all combination
id.check <- function(df2_list, df1_list){
return(map_chr(df2_list, .f = id.return, df1_list = df1_list))
}
# Apply the id.check function
dt_final <- transactns11 %>%
mutate(matching_values = id.check(dt2_list, dt1_list))
# Create example data frame 1
clusters11 <- read.table(text = "cluster_id clusters
1 A,B,C
2 A,B
3 B,C
4 C,D,E
5 B,C,D
6 D,E,F
7 A,D,F
8 B,G,H
9 B,C,F
10 G,H,M
11 A,H,N
12 B,C,M",
header = TRUE, stringsAsFactors = FALSE)
# Create example data frame 2
transactns11 <- read.table(text = "unique_id skills
221 A,B,C
223 A,B
224 B,C
225 C,D,E,F
226 B,C,D,M
227 D,E,F,A
228 A,D,F
229 B,G,H
230 B,C,F,A
231 G,H,M
232 A,H,N
233 B,C,M
234 A,B,C
235 A,B
236 B,C
237 C,D,E
238 B,C,D
239 D,E,F",
header = TRUE, stringsAsFactors = FALSE)
The above code can solve OP's original question. However, based on the OP's comment, the real datasets has random cluster_id
and unique_id
. So I decide to update my answer to be able to generalize the use of this approach.
Basically, the random cluster_id
should not be an issue if the id.return
function can return the right ID. The key to deal with random unique_id
is to sort the data frame by unique_id
before applying the function. Below is the updated code.
This part is the same. However, to simulate the OP's real world data, I randomized the transactns11
data frame and create a sorting_id
to restore the order of the data frame after the applying the function.
# Load package
library(tidyverse)
# Create example data frame 1
clusters11 <- read.table(text = "cluster_id clusters
1 A,B,C
2 A,B
3 B,C
4 C,D,E
5 B,C,D
6 D,E,F
7 A,D,F
8 B,G,H
9 B,C,F
10 G,H,M
11 A,H,N
12 B,C,M",
header = TRUE, stringsAsFactors = FALSE)
# Create example data frame 2
transactns11 <- read.table(text = "unique_id skills
221 A,B,C
223 A,B
224 B,C
225 C,D,E,F
226 B,C,D,M
227 D,E,F,A
228 A,D,F
229 B,G,H
230 B,C,F,A
231 G,H,M
232 A,H,N
233 B,C,M
234 A,B,C
235 A,B
236 B,C
237 C,D,E
238 B,C,D
239 D,E,F",
header = TRUE, stringsAsFactors = FALSE)
Based on the OP's update, randomize the rows to simulate real condition
# Set seed for reproducibility
set.seed(123)
transactns11 <- transactns11 %>%
# Ransomize the rows
sample_n(size = nrow(.))
Now transactns11
looks like this.
unique_id skills
6 227 D,E,F,A
14 235 A,B
7 228 A,D,F
17 238 B,C,D
15 236 B,C
1 221 A,B,C
16 237 C,D,E
10 231 G,H,M
18 239 D,E,F
5 226 B,C,D,M
8 229 B,G,H
4 225 C,D,E,F
9 230 B,C,F,A
3 224 B,C
13 234 A,B,C
11 232 A,H,N
12 233 B,C,M
2 223 A,B
It is important to create the sorting_id
transactns11 <- transactns11 %>%
# Create a sorting ID
mutate(sorting_id = 1:n())
The transactns11
looks like this now.
unique_id skills sorting_id
1 227 D,E,F,A 1
2 235 A,B 2
3 228 A,D,F 3
4 238 B,C,D 4
5 236 B,C 5
6 221 A,B,C 6
7 237 C,D,E 7
8 231 G,H,M 8
9 239 D,E,F 9
10 226 B,C,D,M 10
11 229 B,G,H 11
12 225 C,D,E,F 12
13 230 B,C,F,A 13
14 224 B,C 14
15 234 A,B,C 15
16 232 A,H,N 16
17 233 B,C,M 17
18 223 A,B 18
Now run the following code. The only important update here is to sort the transactns11
by unique_id
when creating dt2_update
# Separate the skills and cluster letters
dt1_update <- clusters11 %>%
mutate(Group = strsplit(clusters, split = ",")) %>%
unnest(Group) %>%
select(-clusters)
dt2_update <- transactns11 %>%
# Sort the data frame by unique_id
arrange(unique_id) %>%
mutate(Group = strsplit(skills, split = ",")) %>%
unnest(Group) %>%
select(-skills)
# Split the data frame
dt1_list <- split(dt1_update, f = dt1_update$cluster_id)
dt2_list <- split(dt2_update, f = dt2_update$unique_id)
# Design a function to determine if the Group elements matched
# df1 is data frame from dt1_list, df2 is data frame from dt2_list
is.match <- function(df1, df2){
if (all(df1$Group %in% df2$Group)){
return(TRUE)
} else {
return(FALSE)
}
}
# Design a function to return the ID
id.return <- function(df2, df1_list){
answer <- map(df1_list, .f = is.match, df2 = df2)
answer <- names(which(unlist(answer)))
return(paste(answer, collapse = ","))
}
# Design a function to apply all the id.return function to all combination
id.check <- function(df2_list, df1_list){
return(map_chr(df2_list, .f = id.return, df1_list = df1_list))
}
Now apply the id.check
function. Notice that before doing that, it is necessary to sort the data frame by unique_id
. After completing the matching, sort the data frame by sorting_id
to restore the original order.
# Apply the id.check function
dt_final <- transactns11 %>%
# Sort the data frame by unique_id
arrange(unique_id) %>%
mutate(matching_values = id.check(dt2_list, dt1_list)) %>%
# Sort the data frame by sorting_id
arrange(sorting_id) %>%
select(-sorting_id)
Here is the final output.
unique_id skills matching_values
1 227 D,E,F,A 6,7
2 235 A,B 2
3 228 A,D,F 7
4 238 B,C,D 3,5
5 236 B,C 3
6 221 A,B,C 1,2,3
7 237 C,D,E 4
8 231 G,H,M 10
9 239 D,E,F 6
10 226 B,C,D,M 3,5,12
11 229 B,G,H 8
12 225 C,D,E,F 4,6
13 230 B,C,F,A 1,2,3,9
14 224 B,C 3
15 234 A,B,C 1,2,3
16 232 A,H,N 11
17 233 B,C,M 3,12
18 223 A,B 2
Upvotes: 0
Reputation: 5956
This new method transforms your data to incidence matrices which make it easy to compare transactions to clusters.
library(arules)
transactions1 <- read.table(text = "unique_id skills
221 A,B,C
223 A,B
224 B,C
225 C,D,E,F
226 B,C,D,M
227 D,E,F,A
228 A,D,F
229 B,G,H
230 B,C,F,A
231 G,H,M
232 A,H,N
233 B,C,M
234 A,B,C
235 A,B
236 B,C
237 C,D,E
238 B,C,D
239 D,E,F", header=T, stringsAsFactors=F)
clusters <- read.table(text="cluster_id clusters
1 A,B,C
2 A,B
3 B,C
4 C,D,E
5 B,C,D
6 D,E,F
7 A,D,F
8 B,G,H
9 B,C,F
10 G,H,M
11 A,H,N
12 B,C,M", header=T, stringsAsFactors=F)
transactions2 <- sapply(split(transactions1$skills, 1:nrow(transactions1)), strsplit, split = ",")
names(transactions2) <- transactions1$unique_id
clusters2 <- sapply(split(clusters$clusters, 1:nrow(clusters)), strsplit, split=",")
names(clusters2) <- clusters$cluster_id
transactions2 <- +(as(as(transactions2, "transactions"), "matrix"))
clusters2 <- +(as(as(clusters2, "transactions"), "matrix"))
coInc <- transactions2 %*% t(clusters2)
coInc <- t(+(t(coInc) == rowSums(clusters2)))
res <- as(coInc, "transactions")
res <- as(res, "list")
res <- as.data.frame(sapply(res, paste, collapse=","))
res$skills <- transactions1$skills[match(rownames(res), transactions1$unique_id)]
res$id <- rownames(res)
colnames(res) <- c("clusters","skills","id")
res <- res[, c(3,2,1)]
# id skills clusters
#221 221 A,B,C 1,2,3
#223 223 A,B 2
#224 224 B,C 3
#225 225 C,D,E,F 4,6
#226 226 B,C,D,M 3,5,12
#227 227 D,E,F,A 6,7
#228 228 A,D,F 7
#229 229 B,G,H 8
#230 230 B,C,F,A 1,2,3,9
#231 231 G,H,M 10
#232 232 A,H,N 11
#233 233 B,C,M 3,12
#234 234 A,B,C 1,2,3
#235 235 A,B 2
#236 236 B,C 3
#237 237 C,D,E 4
#238 238 B,C,D 3,5
#239 239 D,E,F 6
Upvotes: 0
Reputation: 3994
I would use apply
family of functions because it is base R:
clusters11 <- data.frame(cluster_id = seq(1:12),
clusters = c('A,B,C','A,B','B,C','C,D,E','B,C,D','D,E,F','A,D,F',
'B,G,H','B,C,F','G,H,M','A,H,N','B,C,M'))
transactions11 <- data.frame(unique_id = c(221, seq(223,239, by = 1)),
skills = c('A,B,C', 'A,B', 'B,C', 'C,D,E,F', 'B,C,D,M',
'D,E,F,A', 'A,D,F', 'B,G,H', 'B,C,F,A', 'G,H,M',
'A,H,N', 'B,C,M', 'A,B,C', 'A,B', 'B,C', 'C,D,E',
'B,C,D', 'D,E,F'))
s <- apply(sapply(as.character(clusters11$clusters), grepl,
as.character(transactions11$skills)), 1, which)
d.list <- sapply(as.character(clusters11$clusters), strsplit, "\\,")
tf.tab <- lapply(d.list, function(x) apply(sapply(x, function(x)
grepl(x, transactions11$skills)), 1, all, TRUE))
d.matrix <- do.call(cbind, tf.tab)
transactions11 <- data.frame(transactions11,
matching_values = apply(d.matrix, 1, function(x) paste(which(x == TRUE),
collapse = ",")))
> transactions11
unique_id skills matching_values
1 221 A,B,C 1,2,3
2 223 A,B 2
3 224 B,C 3
4 225 C,D,E,F 4,6
5 226 B,C,D,M 3,5,12
6 227 D,E,F,A 6,7
7 228 A,D,F 7
8 229 B,G,H 8
9 230 B,C,F,A 1,2,3,9
10 231 G,H,M 10
11 232 A,H,N 11
12 233 B,C,M 3,12
13 234 A,B,C 1,2,3
14 235 A,B 2
15 236 B,C 3
16 237 C,D,E 4
17 238 B,C,D 3,5
18 239 D,E,F 6
This should work much faster than a loop.
Upvotes: 0
Reputation: 1860
I don't konw how fast this would be, but let's try:
First, store cluster names in codes
codes<-LETTERS[1:14]
Then make Boolean representation of clusters1$clusters
and transactns1$skills
clusters2<-sapply(codes,grepl,x=clusters1$clusters)
transactns2<-sapply(codes,grepl,x=transactns1$skills)
Write a function that tests if given cluster is appropriate for given transactn:
is_ok<-function(clus, tran) !any(!(tran) & clus)
e.g.
is_ok(clusters2[1,], transactns2[2,])
[1] FALSE
This tells you that cluster 1 is not suitable for second transactn.
Then write a function that check all the clusters for given transactn. I use magrittr
package to get access to %>%
operator.
library(magrittr)
matching_values<-function(tran) apply(clusters2,1,is_ok,tran=tran) %>% which
%>% paste(collapse=',')
Fast check:
matching_values(transactns2[5,])
[1] "3,5,12"
Finally, apply the last function to all the transactns:
apply(transactns2,1,matching_values)
[1] "1,2,3" "2" "3" "4,6" "3,5,12" "6,7" "7" "8" "1,2,3,9"
[10] "10" "11" "3,12" "1,2,3" "2" "3" "4" "3,5" "6"
Upvotes: 1