SHP
SHP

Reputation: 43

Alternative way to code without loops in R and reduce looping time

I have 2 datasets with text data. dataset1: clusters11 and dataset2: transactns11. I am using For loop in the code and it takes more time to process the data. So can anyone suggest how I can reduce the time taken. Please go through the below sample datasets and the code I am using.

dataset1: clusters11
cluster_id  clusters
1   A,B,C
2   A,B
3   B,C
4   C,D,E
5   B,C,D
6   D,E,F
7   A,D,F
8   B,G,H
9   B,C,F
10  G,H,M
11  A,H,N
12  B,C,M


dataset2: transactns11
unique_id   skills
221 A,B,C
223 A,B
224 B,C
225 C,D,E,F
226 B,C,D,M
227 D,E,F,A
228 A,D,F
229 B,G,H
230 B,C,F,A
231 G,H,M
232 A,H,N
233 B,C,M
234 A,B,C
235 A,B
236 B,C
237 C,D,E
238 B,C,D
239 D,E,F

I want to have my output:out_df like this( which ever cluster in clusters11 file matches 100% with the skills in transactns11 should appear as comma separate by cluster_id, with the column name matching_values in the output.

unique_id   skills  matching_values
221 A,B,C   1,2,3
223 A,B 2
224 B,C 3
225 C,D,E,F 4,6
226 B,C,D,M 3,5,12
227 D,E,F,A 6,7
228 A,D,F   7
229 B,G,H   8
230 B,C,F,A 1,2,3,9
231 G,H,M   10
232 A,H,N   11
233 B,C,M   3,12
234 A,B,C   1,2,3
235 A,B 2
236 B,C 3
237 C,D,E   4
238 B,C,D   3,5
239 D,E,F   6

For this I have code and it is working fine with me

  out_df <- data.frame(matrix(0, ncol = 3, nrow = nrow(transactns11)))
  colnames(out_df) <- c("unique_id", "skills", "matching_values")

  transactns11$skills=as.character(transactns11$skills)
  clusters11$clusters=as.character(clusters11$clusters)

  for(n in 1:nrow(transactns11)) {

    trans1 <- strsplit(transactns11$skills[n], ",")
    trans1
    transvect <- unlist(trans1)
    transvect

    matching_clusters <- c("")
    out_df$unique_id[n] <- as.character(transactns11$unique_id[n])
    out_df$skills[n] <- transactns11$skills[n]

    for(i in 1:nrow(clusters11)) {

      cluster1 <- strsplit(clusters11$clusters[i],",")
      cluster1
      clustervect <- unlist(cluster1)
      clustervect

      if(length(intersect(transvect,clustervect)) == length(clustervect)) {
        matching_clusters <- paste(matching_clusters, clusters11$cluster_id[i], sep = ",")   
      }
    }
    matching_clusters <- substring(matching_clusters,2)
    out_df$matching_values[n] <- matching_clusters
  }

Problem is that, the code takes more time when the recods of the 2 files are more like 10000 or 20000( time taken will be 15 to 20 hrs). Can any one please suggest how I can avoid this loops or any other method I can follow to improve the code and reduce the huge time taken to process. Thanks in Advance. Regards, hari

Upvotes: 1

Views: 506

Answers (4)

www
www

Reputation: 39154

This solution uses functions from dplyr, tidyr, and purrr. So I loaded the tidyverse package for all these functions. dt_final would be the final output.

# Load package
library(tidyverse)

# Separate the skills and cluster letters
dt1_update <- clusters11 %>%
  mutate(Group = strsplit(clusters, split = ",")) %>%
  unnest(Group) %>%
  select(-clusters)

dt2_update <- transactns11 %>%
  mutate(Group = strsplit(skills, split = ",")) %>%
  unnest(Group) %>%
  select(-skills)

# Split the data frame
dt1_list <- split(dt1_update, f = dt1_update$cluster_id)
dt2_list <- split(dt2_update, f = dt2_update$unique_id)

# Design a function to determine if the Group elements matched

# df1 is data frame from dt1_list, df2 is data frame from dt2_list
is.match <- function(df1, df2){

  if (all(df1$Group %in% df2$Group)){
    return(TRUE)
  } else {
    return(FALSE)
  }
}

# Design a function to return the ID
id.return <- function(df2, df1_list){
  answer <- map(df1_list, .f = is.match, df2 = df2)
  answer <- names(which(unlist(answer)))
  return(paste(answer, collapse = ","))
}

# Design a function to apply all the id.return function to all combination
id.check <- function(df2_list, df1_list){
  return(map_chr(df2_list, .f = id.return, df1_list = df1_list))
}

# Apply the id.check function
dt_final <- transactns11 %>%
  mutate(matching_values = id.check(dt2_list, dt1_list))

Data

# Create example data frame 1
clusters11 <- read.table(text = "cluster_id  clusters
1   A,B,C
                 2   A,B
                 3   B,C
                 4   C,D,E
                 5   B,C,D
                 6   D,E,F
                 7   A,D,F
                 8   B,G,H
                 9   B,C,F
                 10  G,H,M
                 11  A,H,N
                 12  B,C,M", 
                 header = TRUE, stringsAsFactors = FALSE)

# Create example data frame 2
transactns11 <- read.table(text = "unique_id   skills
221 A,B,C
                  223 A,B
                  224 B,C
                  225 C,D,E,F
                  226 B,C,D,M
                  227 D,E,F,A
                  228 A,D,F
                  229 B,G,H
                  230 B,C,F,A
                  231 G,H,M
                  232 A,H,N
                  233 B,C,M
                  234 A,B,C
                  235 A,B
                  236 B,C
                  237 C,D,E
                  238 B,C,D
                  239 D,E,F",
                  header = TRUE, stringsAsFactors = FALSE)

Update

The above code can solve OP's original question. However, based on the OP's comment, the real datasets has random cluster_id and unique_id. So I decide to update my answer to be able to generalize the use of this approach.

Basically, the random cluster_id should not be an issue if the id.return function can return the right ID. The key to deal with random unique_id is to sort the data frame by unique_id before applying the function. Below is the updated code.

Data Preparation

This part is the same. However, to simulate the OP's real world data, I randomized the transactns11 data frame and create a sorting_id to restore the order of the data frame after the applying the function.

# Load package
library(tidyverse)

# Create example data frame 1
clusters11 <- read.table(text = "cluster_id  clusters
                         1   A,B,C
                         2   A,B
                         3   B,C
                         4   C,D,E
                         5   B,C,D
                         6   D,E,F
                         7   A,D,F
                         8   B,G,H
                         9   B,C,F
                         10  G,H,M
                         11  A,H,N
                         12  B,C,M", 
                 header = TRUE, stringsAsFactors = FALSE)

# Create example data frame 2
transactns11 <- read.table(text = "unique_id   skills
                           221 A,B,C
                           223 A,B
                           224 B,C
                           225 C,D,E,F
                           226 B,C,D,M
                           227 D,E,F,A
                           228 A,D,F
                           229 B,G,H
                           230 B,C,F,A
                           231 G,H,M
                           232 A,H,N
                           233 B,C,M
                           234 A,B,C
                           235 A,B
                           236 B,C
                           237 C,D,E
                           238 B,C,D
                           239 D,E,F",
                  header = TRUE, stringsAsFactors = FALSE)

Based on the OP's update, randomize the rows to simulate real condition

# Set seed for reproducibility
set.seed(123)

transactns11 <- transactns11 %>%
  # Ransomize the rows
  sample_n(size = nrow(.))

Now transactns11 looks like this.

   unique_id  skills
6        227 D,E,F,A
14       235     A,B
7        228   A,D,F
17       238   B,C,D
15       236     B,C
1        221   A,B,C
16       237   C,D,E
10       231   G,H,M
18       239   D,E,F
5        226 B,C,D,M
8        229   B,G,H
4        225 C,D,E,F
9        230 B,C,F,A
3        224     B,C
13       234   A,B,C
11       232   A,H,N
12       233   B,C,M
2        223     A,B

It is important to create the sorting_id

transactns11 <- transactns11 %>%
  # Create a sorting ID
  mutate(sorting_id = 1:n())

The transactns11 looks like this now.

   unique_id  skills sorting_id
1        227 D,E,F,A          1
2        235     A,B          2
3        228   A,D,F          3
4        238   B,C,D          4
5        236     B,C          5
6        221   A,B,C          6
7        237   C,D,E          7
8        231   G,H,M          8
9        239   D,E,F          9
10       226 B,C,D,M         10
11       229   B,G,H         11
12       225 C,D,E,F         12
13       230 B,C,F,A         13
14       224     B,C         14
15       234   A,B,C         15
16       232   A,H,N         16
17       233   B,C,M         17
18       223     A,B         18

Find the matching ID

Now run the following code. The only important update here is to sort the transactns11 by unique_id when creating dt2_update

# Separate the skills and cluster letters
dt1_update <- clusters11 %>%
  mutate(Group = strsplit(clusters, split = ",")) %>%
  unnest(Group) %>%
  select(-clusters)

dt2_update <- transactns11  %>%
  # Sort the data frame by unique_id
  arrange(unique_id) %>%
  mutate(Group = strsplit(skills, split = ",")) %>%
  unnest(Group) %>%
  select(-skills)

# Split the data frame
dt1_list <- split(dt1_update, f = dt1_update$cluster_id)
dt2_list <- split(dt2_update, f = dt2_update$unique_id)

# Design a function to determine if the Group elements matched

# df1 is data frame from dt1_list, df2 is data frame from dt2_list
is.match <- function(df1, df2){

  if (all(df1$Group %in% df2$Group)){
    return(TRUE)
  } else {
    return(FALSE)
  }
}

# Design a function to return the ID
id.return <- function(df2, df1_list){
  answer <- map(df1_list, .f = is.match, df2 = df2)
  answer <- names(which(unlist(answer)))
  return(paste(answer, collapse = ","))
}

# Design a function to apply all the id.return function to all combination
id.check <- function(df2_list, df1_list){
  return(map_chr(df2_list, .f = id.return, df1_list = df1_list))
}

Now apply the id.check function. Notice that before doing that, it is necessary to sort the data frame by unique_id. After completing the matching, sort the data frame by sorting_id to restore the original order.

# Apply the id.check function
dt_final <- transactns11 %>%
  # Sort the data frame by unique_id
  arrange(unique_id) %>%
  mutate(matching_values = id.check(dt2_list, dt1_list)) %>%
  # Sort the data frame by sorting_id
  arrange(sorting_id) %>%
  select(-sorting_id)

Here is the final output.

   unique_id  skills matching_values
1        227 D,E,F,A             6,7
2        235     A,B               2
3        228   A,D,F               7
4        238   B,C,D             3,5
5        236     B,C               3
6        221   A,B,C           1,2,3
7        237   C,D,E               4
8        231   G,H,M              10
9        239   D,E,F               6
10       226 B,C,D,M          3,5,12
11       229   B,G,H               8
12       225 C,D,E,F             4,6
13       230 B,C,F,A         1,2,3,9
14       224     B,C               3
15       234   A,B,C           1,2,3
16       232   A,H,N              11
17       233   B,C,M            3,12
18       223     A,B               2

Upvotes: 0

emilliman5
emilliman5

Reputation: 5956

New solution with transactions and matrices:

This new method transforms your data to incidence matrices which make it easy to compare transactions to clusters.

library(arules)

transactions1 <- read.table(text = "unique_id   skills
                            221 A,B,C
                            223 A,B
                            224 B,C
                            225 C,D,E,F
                            226 B,C,D,M
                            227 D,E,F,A
                            228 A,D,F
                            229 B,G,H
                            230 B,C,F,A
                            231 G,H,M
                            232 A,H,N
                            233 B,C,M
                            234 A,B,C
                            235 A,B
                            236 B,C
                            237 C,D,E
                            238 B,C,D
                            239 D,E,F", header=T, stringsAsFactors=F)

clusters <- read.table(text="cluster_id  clusters
                       1   A,B,C
                       2   A,B
                       3   B,C
                       4   C,D,E
                       5   B,C,D
                       6   D,E,F
                       7   A,D,F
                       8   B,G,H
                       9   B,C,F
                       10  G,H,M
                       11  A,H,N
                       12  B,C,M", header=T, stringsAsFactors=F)

transactions2 <- sapply(split(transactions1$skills, 1:nrow(transactions1)), strsplit, split = ",")
names(transactions2) <- transactions1$unique_id

clusters2 <- sapply(split(clusters$clusters, 1:nrow(clusters)), strsplit, split=",")
names(clusters2) <- clusters$cluster_id

transactions2 <- +(as(as(transactions2, "transactions"), "matrix"))
clusters2 <- +(as(as(clusters2, "transactions"), "matrix"))

coInc <- transactions2 %*% t(clusters2)
coInc <- t(+(t(coInc) == rowSums(clusters2)))

res <- as(coInc, "transactions")
res <- as(res, "list")
res <- as.data.frame(sapply(res, paste, collapse=","))
res$skills <- transactions1$skills[match(rownames(res), transactions1$unique_id)]
res$id <- rownames(res)
colnames(res) <- c("clusters","skills","id")
res <- res[, c(3,2,1)]

#     id  skills clusters
#221 221   A,B,C    1,2,3
#223 223     A,B        2
#224 224     B,C        3
#225 225 C,D,E,F      4,6
#226 226 B,C,D,M   3,5,12
#227 227 D,E,F,A      6,7
#228 228   A,D,F        7
#229 229   B,G,H        8
#230 230 B,C,F,A  1,2,3,9
#231 231   G,H,M       10
#232 232   A,H,N       11
#233 233   B,C,M     3,12
#234 234   A,B,C    1,2,3
#235 235     A,B        2
#236 236     B,C        3
#237 237   C,D,E        4
#238 238   B,C,D      3,5
#239 239   D,E,F        6

Upvotes: 0

akash87
akash87

Reputation: 3994

I would use apply family of functions because it is base R:

clusters11 <- data.frame(cluster_id = seq(1:12), 
                         clusters = c('A,B,C','A,B','B,C','C,D,E','B,C,D','D,E,F','A,D,F',
                                    'B,G,H','B,C,F','G,H,M','A,H,N','B,C,M'))
transactions11 <- data.frame(unique_id = c(221, seq(223,239, by = 1)), 
                             skills = c('A,B,C', 'A,B', 'B,C', 'C,D,E,F', 'B,C,D,M', 
                                        'D,E,F,A', 'A,D,F', 'B,G,H', 'B,C,F,A', 'G,H,M', 
                                        'A,H,N', 'B,C,M', 'A,B,C', 'A,B', 'B,C', 'C,D,E',
                                        'B,C,D', 'D,E,F'))

s <- apply(sapply(as.character(clusters11$clusters), grepl,
           as.character(transactions11$skills)), 1, which)

d.list <- sapply(as.character(clusters11$clusters), strsplit, "\\,")

tf.tab  <- lapply(d.list, function(x) apply(sapply(x, function(x) 
                  grepl(x, transactions11$skills)), 1, all, TRUE))

d.matrix <- do.call(cbind, tf.tab)

transactions11 <- data.frame(transactions11, 
                   matching_values = apply(d.matrix, 1, function(x) paste(which(x == TRUE), 
                                           collapse = ",")))

> transactions11
   unique_id  skills matching_values
1        221   A,B,C           1,2,3
2        223     A,B               2
3        224     B,C               3
4        225 C,D,E,F             4,6
5        226 B,C,D,M          3,5,12
6        227 D,E,F,A             6,7
7        228   A,D,F               7
8        229   B,G,H               8
9        230 B,C,F,A         1,2,3,9
10       231   G,H,M              10
11       232   A,H,N              11
12       233   B,C,M            3,12
13       234   A,B,C           1,2,3
14       235     A,B               2
15       236     B,C               3
16       237   C,D,E               4
17       238   B,C,D             3,5
18       239   D,E,F               6

This should work much faster than a loop.

Upvotes: 0

Łukasz Deryło
Łukasz Deryło

Reputation: 1860

I don't konw how fast this would be, but let's try:

First, store cluster names in codes

codes<-LETTERS[1:14]

Then make Boolean representation of clusters1$clusters and transactns1$skills

clusters2<-sapply(codes,grepl,x=clusters1$clusters)
transactns2<-sapply(codes,grepl,x=transactns1$skills)

Write a function that tests if given cluster is appropriate for given transactn:

is_ok<-function(clus, tran) !any(!(tran) & clus)

e.g.

is_ok(clusters2[1,], transactns2[2,])
[1] FALSE

This tells you that cluster 1 is not suitable for second transactn.

Then write a function that check all the clusters for given transactn. I use magrittr package to get access to %>% operator.

library(magrittr)
matching_values<-function(tran) apply(clusters2,1,is_ok,tran=tran) %>% which 
%>% paste(collapse=',')

Fast check:

matching_values(transactns2[5,])
[1] "3,5,12"

Finally, apply the last function to all the transactns:

apply(transactns2,1,matching_values)
[1] "1,2,3"   "2"       "3"       "4,6"     "3,5,12"  "6,7"     "7"       "8"       "1,2,3,9"
[10] "10"      "11"      "3,12"    "1,2,3"   "2"       "3"       "4"       "3,5"     "6"   

Upvotes: 1

Related Questions