Markus Kindler
Markus Kindler

Reputation: 33

Count occurrences of strings across multiple columns efficiently

I have a large data frame (>4 million rows) with columns yname1, yname2, yname3 that store strings:

yname1 | yname2 | yname3
aaaaaa | bbbaaa | bbaaaa
aaabbb | cccccc | aaaaaa
aaaaaa | aaabbb | dddddd
cccccc | dddddd | eeeeee

Now I want to calculate the total number of occurrences of each string across all columns. These should be added as additional columns:

yname1 | yname2 | yname3 | rcount1 | rcount2 | rcount3
aaaaaa | bbbaaa | bbaaaa |       3 |       1 |       1
aaabbb | cccccc | aaaaaa |       2 |       2 |       3
aaaaaa | aaabbb | dddddd |       3 |       2 |       2
cccccc | dddddd | eeeeee |       2 |       2 |       1

I have already written the following code, which does the job:

data3$rcount1 <- sapply(data3$yname1, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount2 <- sapply(data3$yname2, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount3 <- sapply(data3$yname3, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))

However, this is really slow and would take days to calculate. Any ideas how I could speed this up?

Upvotes: 3

Views: 3974

Answers (5)

Jota
Jota

Reputation: 17621

How about a data.table approach:

library(data.table)
setDT(d)

lookup <- melt(d, measure.vars = paste0("yname", 1:3))[, .N, by = value]
#    value N
#1: aaaaaa 3
#2: aaabbb 2
#3: cccccc 2
#4: bbbaaa 1
#5: dddddd 2
#6: bbaaaa 1
#7: eeeeee 1

d[, paste0("rcount", 1:3) :=
   lapply(d, function(x) lookup[x, , on = .(value)][, N])]

#   yname1 yname2 yname3 rcount1 rcount2 rcount3
#1: aaaaaa bbbaaa bbaaaa       3       1       1
#2: aaabbb cccccc aaaaaa       2       2       3
#3: aaaaaa aaabbb dddddd       3       2       2
#4: cccccc dddddd eeeeee       2       2       1

Microbenchmark output copying from bgoldst's example, but with 400,000 rows.

Unit: seconds
          expr       min        lq      mean    median        uq       max neval
   bgoldst(df) 21.445961 21.628228 21.876051 21.810496 22.091096 22.371697     3  
 alistaire(df) 20.685357 20.961761 21.255457 21.238164 21.540507 21.842850     3  
      jota(dt)  2.629337  2.692613  2.719207  2.755889  2.764141  2.772394     3   
    mhairi(df) 40.780441 41.048345 41.669798 41.316249 42.114476 42.912702     3
   coffein(df) 35.669630 35.678719 36.453257 35.687808 36.845071 38.002334     3 
  espresso(df) 20.823840 20.976175 21.317218 21.128509 21.563907 21.999306     3

Upvotes: 6

bgoldst
bgoldst

Reputation: 35324

There are some good solutions already, but none use match() to look up each string in a precomputed frequency table. Below is how this can be done. Note that I opted for as.matrix() to produce a matrix of the yname* columns for both the argument to table() and the first argument to match().

cns <- grep(value=T,'^yname',names(df));
m <- as.matrix(df[cns]);
cnts <- table(m);
df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df));
df;
##   yname1 yname2 yname3 rcount1 rcount2 rcount3
## 1 aaaaaa bbbaaa bbaaaa       3       1       1
## 2 aaabbb cccccc aaaaaa       2       2       3
## 3 aaaaaa aaabbb dddddd       3       2       2
## 4 cccccc dddddd eeeeee       2       2       1

Update: I can't believe I missed this previously, but the expression

cnts[match(m,names(cnts))]

can be replaced with

cnts[m]

So a call to match() is not necessary at all.

I just reran the benchmarks and found that it doesn't change the run-time of my solution in any significant way (perhaps just a slight speedup in the small-scale test). Presumably this is because indexing a vector with character names requires the same kind of match() logic internally, so no performance is gained by the above replacement. But I would say the improvement in concision and simplicity is worth it.


Benchmarking

I should note that I made some small modifications to some of the other solutions in order to produce these benchmarking results. Most notably, I wanted to avoid having to copy any of the inputs for the repetitive executions, but since data.tables get passed by reference, I had to modify jota() to make it idempotent. This involved filtering for just the target yname* columns, which I precompute into a local variable called cns via a grep() call, just as I do in my own solution. To try to be fair, I added the same grep() call and filtering logic to all solutions, except for markus(), which doesn't require it since it explicitly processes each column separately. I also changed the index-join operation on lookup in jota() to lookup[.(value=x),,on='value'], since it wasn't working for me otherwise. Finally, for mhairi(), I completed the solution by adding a Reduce() call to merge in all yname* columns.

library(microbenchmark);
library(data.table);
library(dplyr);

bgoldst <- function(df) { cns <- grep(value=T,'^yname',names(df)); m <- as.matrix(df[cns]); cnts <- table(m); df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df)); df; };
markus <- function(df) { df$rcount1 <- sapply(df$yname1, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount2 <- sapply(df$yname2, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount3 <- sapply(df$yname3, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df; };
alistaire <- function(df) { cns <- grep(value=T,'^yname',names(df)); df_table <- table(unlist(df[cns])); data.frame(df[cns],lapply(df[cns],function(x){data.frame(Freq=df_table[as.character(x)])})); };
jota <- function(dt) { cns <- grep(value=T,'^yname',names(df)); lookup <- melt(dt, measure.vars = cns)[, .N, by = value]; dt[, paste0("rcount", 1:3) := lapply(dt[,cns,with=F], function(x) lookup[.(value=x), , on = 'value'][, N])]; };
mhairi <- function(df) { cns <- grep(value=T,'^yname',names(df)); all_yname <-do.call(c,df[cns]); rcount <- as.data.frame(table(all_yname)); Reduce(function(df,cn) merge(df, rcount, by.x = cn, by.y = 'all_yname'),cns,df); };
coffein <- function(df) { cns <- grep(value=T,'^yname',names(df)); df2 <- melt(df[cns], id.vars = NULL); df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame(); rownames(df2) <- df2$value; df2$value <- NULL; df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df; };

## OP's test case
df <- data.frame(yname1=c('aaaaaa','aaabbb','aaaaaa','cccccc'),yname2=c('bbbaaa','cccccc','aaabbb','dddddd'),yname3=c('bbaaaa','aaaaaa','dddddd','eeeeee'),stringsAsFactors=F);
dt <- as.data.table(df);

ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,ex,y)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE

microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df));
## Unit: microseconds
##           expr      min        lq      mean    median       uq      max neval
##    bgoldst(df)  491.373  544.6165  597.4743  575.8350  609.192 2054.872   100
##     markus(df)  375.907  435.5645  463.7258  467.4250  489.022  549.962   100
##  alistaire(df)  754.380  816.1755  849.8749  840.3385  888.021  959.654   100
##       jota(dt) 4143.955 4425.7785 4741.8354 4656.2835 4854.928 7347.930   100
##     mhairi(df) 1938.122 2047.1740 2182.1841 2135.4850 2209.896 3969.045   100
##    coffein(df) 1286.380 1430.9265 1546.3245 1511.3255 1562.430 3319.441   100

## scale test
set.seed(1L);
NR <- 4e3L; NC <- 3L; SL <- 6L;
df <- as.data.frame(setNames(nm=paste0('yname',seq_len(NC)),replicate(NC,do.call(paste0,replicate(SL,sample(letters,NR,T),simplify=F)),simplify=F)),stringsAsFactors=F);
dt <- as.data.table(df);

ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,y,ex)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE

microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df),times=3L);
## Unit: milliseconds
##           expr        min         lq       mean     median         uq        max neval
##    bgoldst(df)   85.20766   87.00487   88.39154   88.80209   89.98348   91.16487     3
##     markus(df) 3771.08606 3788.97413 3799.08405 3806.86220 3813.08305 3819.30390     3
##  alistaire(df)   83.03348   83.10276   83.18116   83.17204   83.25500   83.33797     3
##       jota(dt)   12.49174   13.82088   14.44939   15.15002   15.42821   15.70640     3
##     mhairi(df)  156.06459  156.36608  158.27256  156.66758  159.37654  162.08551     3
##    coffein(df)  154.02853  154.97215  156.52246  155.91576  157.76942  159.62309     3

Upvotes: 5

coffeinjunky
coffeinjunky

Reputation: 11514

I like the above answers more, but just for completeness, let me add one alternative, which is based on using the unique strings as rownames:

df2 <- melt(df, id.vars = NULL)
df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame()
rownames(df2) <- df2$value
df2$value <- NULL

Now we have a dataframe containing the number of occurences of the unique character vectors, and the character vectors are rownames. These we can use to subset said dataframe.

# df[] <- lapply(df, as.character) # in case they are stored as factors
df$r1 <- df2[df$yname1,]
df$r2 <- df2[df$yname2,]
df$r3 <- df2[df$yname3,]

> df
  yname1 yname2 yname3 r1 r2 r3
1 aaaaaa bbbaaa bbaaaa  3  1  1
2 aaabbb cccccc aaaaaa  2  2  3
3 aaaaaa aaabbb dddddd  3  2  2
4 cccccc dddddd eeeeee  2  2  1

Edit:

Looking at the other answers, and given that we started talking about performance, I realized the above is unneccessarily complicated and can be improved as follows:

df2 <- data.frame(table(unlist(df)), row.names = 1)
df$r1 <- df2[df$yname1,]
df$r2 <- df2[df$yname2,]
df$r3 <- df2[df$yname3,]

This avoids the invoking reshape2 and dplyr completely, and improves the performance accordingly. Using

espresso <- function(df) { 
  cns <- grep(value=T,'^yname',names(df)); 
  df2 <- data.frame(table(unlist(df[cns])), row.names = 1)
  df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df
};

this solution is now much faster, though not as fast as some of the alternatives. See

microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df), espresso(df), times=1000);
Unit: microseconds
          expr      min       lq      mean    median        uq       max neval
   bgoldst(df)  579.447  673.956  739.9614  713.2980  759.0550  3719.153  1000
    markus(df)  549.514  630.123  681.1892  655.1390  679.0870  3767.048  1000
 alistaire(df) 1662.650 1796.287 1957.4346 1851.8795 1921.5840 26532.692  1000
      jota(dt) 5551.147 5897.745 6333.6954 6041.8590 6283.6880 22457.746  1000
    mhairi(df) 2538.450 2717.843 2990.8535 2793.1070 2910.9205 65752.067  1000
   coffein(df) 1636.565 1858.936 2006.7821 1941.2555 2016.7330  4553.044  1000
  espresso(df)  753.496  825.766  910.6520  865.5365  925.4055  4662.091  1000

Upvotes: 3

alistaire
alistaire

Reputation: 43364

In base R, you can build a table of the unlisted values of the data.frame, and index them by the values. Make sure what you index by is a string, not a factor (thus the as.character), or it will be indexed by number instead of name.

data.frame(df, 
           lapply(df, function(x){data.frame(table(unlist(df))[as.character(x)])['Freq']})
)
#   yname1 yname2 yname3 Freq Freq.1 Freq.2
# 1 aaaaaa bbbaaa bbaaaa    3      1      1
# 2 aaabbb cccccc aaaaaa    2      2      3
# 3 aaaaaa aaabbb dddddd    3      2      2
# 4 cccccc dddddd eeeeee    2      2      1

If the data.frame is large enough that this is slow, you can build the table outside of the lapply so it only runs once:

df_table <- table(unlist(df))
data.frame(df, lapply(df, function(x){data.frame(df_table[as.character(x)])['Freq']}))

You can also put it in dplyr, which makes it more readable:

       # look up times repeated
df %>% mutate_each(funs(table(unlist(df))[as.character(.)])) %>%   # or mutate_each(funs(df_table[as.character(.)]))
    # fix column names
    select(rcount = starts_with('yname')) %>% 
    # add original df back in
    bind_cols(df, .)
# Source: local data frame [4 x 6]
# 
#   yname1 yname2 yname3 rcount1 rcount2 rcount3
#   (fctr) (fctr) (fctr)  (tabl)  (tabl)  (tabl)
# 1 aaaaaa bbbaaa bbaaaa       3       1       1
# 2 aaabbb cccccc aaaaaa       2       2       3
# 3 aaaaaa aaabbb dddddd       3       2       2
# 4 cccccc dddddd eeeeee       2       2       1

Data

df <- structure(list(yname1 = c("aaaaaa", "aaabbb", "aaaaaa", "cccccc"
    ), yname2 = c("bbbaaa", "cccccc", "aaabbb", "dddddd"), yname3 = c("bbaaaa", 
    "aaaaaa", "dddddd", "eeeeee")), .Names = c("yname1", "yname2", 
    "yname3"), row.names = c(NA, -4L), class = "data.frame")

Upvotes: 6

Mhairi McNeill
Mhairi McNeill

Reputation: 1991

I think it would be faster to find the sum for each unique value and then join onto the original table.

all_yname <-c(df$yname1, df$yname2, df$yname3)
rcount <- as.data.frame(table(all_yname))

merge(df, rcount, by.x = 'yname1', by.y = 'all_yname')

And repeat the merge for each row.

Upvotes: 2

Related Questions