Reputation: 33
I have a large data frame (>4 million rows) with columns yname1
, yname2
, yname3
that store strings:
yname1 | yname2 | yname3
aaaaaa | bbbaaa | bbaaaa
aaabbb | cccccc | aaaaaa
aaaaaa | aaabbb | dddddd
cccccc | dddddd | eeeeee
Now I want to calculate the total number of occurrences of each string across all columns. These should be added as additional columns:
yname1 | yname2 | yname3 | rcount1 | rcount2 | rcount3
aaaaaa | bbbaaa | bbaaaa | 3 | 1 | 1
aaabbb | cccccc | aaaaaa | 2 | 2 | 3
aaaaaa | aaabbb | dddddd | 3 | 2 | 2
cccccc | dddddd | eeeeee | 2 | 2 | 1
I have already written the following code, which does the job:
data3$rcount1 <- sapply(data3$yname1, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount2 <- sapply(data3$yname2, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount3 <- sapply(data3$yname3, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
However, this is really slow and would take days to calculate. Any ideas how I could speed this up?
Upvotes: 3
Views: 3974
Reputation: 17621
How about a data.table
approach:
library(data.table)
setDT(d)
lookup <- melt(d, measure.vars = paste0("yname", 1:3))[, .N, by = value]
# value N
#1: aaaaaa 3
#2: aaabbb 2
#3: cccccc 2
#4: bbbaaa 1
#5: dddddd 2
#6: bbaaaa 1
#7: eeeeee 1
d[, paste0("rcount", 1:3) :=
lapply(d, function(x) lookup[x, , on = .(value)][, N])]
# yname1 yname2 yname3 rcount1 rcount2 rcount3
#1: aaaaaa bbbaaa bbaaaa 3 1 1
#2: aaabbb cccccc aaaaaa 2 2 3
#3: aaaaaa aaabbb dddddd 3 2 2
#4: cccccc dddddd eeeeee 2 2 1
Microbenchmark output copying from bgoldst's example, but with 400,000 rows.
Unit: seconds
expr min lq mean median uq max neval
bgoldst(df) 21.445961 21.628228 21.876051 21.810496 22.091096 22.371697 3
alistaire(df) 20.685357 20.961761 21.255457 21.238164 21.540507 21.842850 3
jota(dt) 2.629337 2.692613 2.719207 2.755889 2.764141 2.772394 3
mhairi(df) 40.780441 41.048345 41.669798 41.316249 42.114476 42.912702 3
coffein(df) 35.669630 35.678719 36.453257 35.687808 36.845071 38.002334 3
espresso(df) 20.823840 20.976175 21.317218 21.128509 21.563907 21.999306 3
Upvotes: 6
Reputation: 35324
There are some good solutions already, but none use match()
to look up each string in a precomputed frequency table. Below is how this can be done. Note that I opted for as.matrix()
to produce a matrix of the yname*
columns for both the argument to table()
and the first argument to match()
.
cns <- grep(value=T,'^yname',names(df));
m <- as.matrix(df[cns]);
cnts <- table(m);
df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df));
df;
## yname1 yname2 yname3 rcount1 rcount2 rcount3
## 1 aaaaaa bbbaaa bbaaaa 3 1 1
## 2 aaabbb cccccc aaaaaa 2 2 3
## 3 aaaaaa aaabbb dddddd 3 2 2
## 4 cccccc dddddd eeeeee 2 2 1
Update: I can't believe I missed this previously, but the expression
cnts[match(m,names(cnts))]
can be replaced with
cnts[m]
So a call to match()
is not necessary at all.
I just reran the benchmarks and found that it doesn't change the run-time of my solution in any significant way (perhaps just a slight speedup in the small-scale test). Presumably this is because indexing a vector with character names requires the same kind of match()
logic internally, so no performance is gained by the above replacement. But I would say the improvement in concision and simplicity is worth it.
I should note that I made some small modifications to some of the other solutions in order to produce these benchmarking results. Most notably, I wanted to avoid having to copy any of the inputs for the repetitive executions, but since data.tables get passed by reference, I had to modify jota()
to make it idempotent. This involved filtering for just the target yname*
columns, which I precompute into a local variable called cns
via a grep()
call, just as I do in my own solution. To try to be fair, I added the same grep()
call and filtering logic to all solutions, except for markus()
, which doesn't require it since it explicitly processes each column separately. I also changed the index-join operation on lookup
in jota()
to lookup[.(value=x),,on='value']
, since it wasn't working for me otherwise. Finally, for mhairi()
, I completed the solution by adding a Reduce()
call to merge in all yname*
columns.
library(microbenchmark);
library(data.table);
library(dplyr);
bgoldst <- function(df) { cns <- grep(value=T,'^yname',names(df)); m <- as.matrix(df[cns]); cnts <- table(m); df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df)); df; };
markus <- function(df) { df$rcount1 <- sapply(df$yname1, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount2 <- sapply(df$yname2, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount3 <- sapply(df$yname3, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df; };
alistaire <- function(df) { cns <- grep(value=T,'^yname',names(df)); df_table <- table(unlist(df[cns])); data.frame(df[cns],lapply(df[cns],function(x){data.frame(Freq=df_table[as.character(x)])})); };
jota <- function(dt) { cns <- grep(value=T,'^yname',names(df)); lookup <- melt(dt, measure.vars = cns)[, .N, by = value]; dt[, paste0("rcount", 1:3) := lapply(dt[,cns,with=F], function(x) lookup[.(value=x), , on = 'value'][, N])]; };
mhairi <- function(df) { cns <- grep(value=T,'^yname',names(df)); all_yname <-do.call(c,df[cns]); rcount <- as.data.frame(table(all_yname)); Reduce(function(df,cn) merge(df, rcount, by.x = cn, by.y = 'all_yname'),cns,df); };
coffein <- function(df) { cns <- grep(value=T,'^yname',names(df)); df2 <- melt(df[cns], id.vars = NULL); df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame(); rownames(df2) <- df2$value; df2$value <- NULL; df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df; };
## OP's test case
df <- data.frame(yname1=c('aaaaaa','aaabbb','aaaaaa','cccccc'),yname2=c('bbbaaa','cccccc','aaabbb','dddddd'),yname3=c('bbaaaa','aaaaaa','dddddd','eeeeee'),stringsAsFactors=F);
dt <- as.data.table(df);
ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,ex,y)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE
microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst(df) 491.373 544.6165 597.4743 575.8350 609.192 2054.872 100
## markus(df) 375.907 435.5645 463.7258 467.4250 489.022 549.962 100
## alistaire(df) 754.380 816.1755 849.8749 840.3385 888.021 959.654 100
## jota(dt) 4143.955 4425.7785 4741.8354 4656.2835 4854.928 7347.930 100
## mhairi(df) 1938.122 2047.1740 2182.1841 2135.4850 2209.896 3969.045 100
## coffein(df) 1286.380 1430.9265 1546.3245 1511.3255 1562.430 3319.441 100
## scale test
set.seed(1L);
NR <- 4e3L; NC <- 3L; SL <- 6L;
df <- as.data.frame(setNames(nm=paste0('yname',seq_len(NC)),replicate(NC,do.call(paste0,replicate(SL,sample(letters,NR,T),simplify=F)),simplify=F)),stringsAsFactors=F);
dt <- as.data.table(df);
ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,y,ex)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE
microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df),times=3L);
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(df) 85.20766 87.00487 88.39154 88.80209 89.98348 91.16487 3
## markus(df) 3771.08606 3788.97413 3799.08405 3806.86220 3813.08305 3819.30390 3
## alistaire(df) 83.03348 83.10276 83.18116 83.17204 83.25500 83.33797 3
## jota(dt) 12.49174 13.82088 14.44939 15.15002 15.42821 15.70640 3
## mhairi(df) 156.06459 156.36608 158.27256 156.66758 159.37654 162.08551 3
## coffein(df) 154.02853 154.97215 156.52246 155.91576 157.76942 159.62309 3
Upvotes: 5
Reputation: 11514
I like the above answers more, but just for completeness, let me add one alternative, which is based on using the unique strings as rownames:
df2 <- melt(df, id.vars = NULL)
df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame()
rownames(df2) <- df2$value
df2$value <- NULL
Now we have a dataframe containing the number of occurences of the unique character vectors, and the character vectors are rownames. These we can use to subset said dataframe.
# df[] <- lapply(df, as.character) # in case they are stored as factors
df$r1 <- df2[df$yname1,]
df$r2 <- df2[df$yname2,]
df$r3 <- df2[df$yname3,]
> df
yname1 yname2 yname3 r1 r2 r3
1 aaaaaa bbbaaa bbaaaa 3 1 1
2 aaabbb cccccc aaaaaa 2 2 3
3 aaaaaa aaabbb dddddd 3 2 2
4 cccccc dddddd eeeeee 2 2 1
Edit:
Looking at the other answers, and given that we started talking about performance, I realized the above is unneccessarily complicated and can be improved as follows:
df2 <- data.frame(table(unlist(df)), row.names = 1)
df$r1 <- df2[df$yname1,]
df$r2 <- df2[df$yname2,]
df$r3 <- df2[df$yname3,]
This avoids the invoking reshape2
and dplyr
completely, and improves the performance accordingly. Using
espresso <- function(df) {
cns <- grep(value=T,'^yname',names(df));
df2 <- data.frame(table(unlist(df[cns])), row.names = 1)
df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df
};
this solution is now much faster, though not as fast as some of the alternatives. See
microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df), espresso(df), times=1000);
Unit: microseconds
expr min lq mean median uq max neval
bgoldst(df) 579.447 673.956 739.9614 713.2980 759.0550 3719.153 1000
markus(df) 549.514 630.123 681.1892 655.1390 679.0870 3767.048 1000
alistaire(df) 1662.650 1796.287 1957.4346 1851.8795 1921.5840 26532.692 1000
jota(dt) 5551.147 5897.745 6333.6954 6041.8590 6283.6880 22457.746 1000
mhairi(df) 2538.450 2717.843 2990.8535 2793.1070 2910.9205 65752.067 1000
coffein(df) 1636.565 1858.936 2006.7821 1941.2555 2016.7330 4553.044 1000
espresso(df) 753.496 825.766 910.6520 865.5365 925.4055 4662.091 1000
Upvotes: 3
Reputation: 43364
In base R, you can build a table of the unlisted values of the data.frame, and index them by the values. Make sure what you index by is a string, not a factor (thus the as.character
), or it will be indexed by number instead of name.
data.frame(df,
lapply(df, function(x){data.frame(table(unlist(df))[as.character(x)])['Freq']})
)
# yname1 yname2 yname3 Freq Freq.1 Freq.2
# 1 aaaaaa bbbaaa bbaaaa 3 1 1
# 2 aaabbb cccccc aaaaaa 2 2 3
# 3 aaaaaa aaabbb dddddd 3 2 2
# 4 cccccc dddddd eeeeee 2 2 1
If the data.frame is large enough that this is slow, you can build the table outside of the lapply
so it only runs once:
df_table <- table(unlist(df))
data.frame(df, lapply(df, function(x){data.frame(df_table[as.character(x)])['Freq']}))
You can also put it in dplyr
, which makes it more readable:
# look up times repeated
df %>% mutate_each(funs(table(unlist(df))[as.character(.)])) %>% # or mutate_each(funs(df_table[as.character(.)]))
# fix column names
select(rcount = starts_with('yname')) %>%
# add original df back in
bind_cols(df, .)
# Source: local data frame [4 x 6]
#
# yname1 yname2 yname3 rcount1 rcount2 rcount3
# (fctr) (fctr) (fctr) (tabl) (tabl) (tabl)
# 1 aaaaaa bbbaaa bbaaaa 3 1 1
# 2 aaabbb cccccc aaaaaa 2 2 3
# 3 aaaaaa aaabbb dddddd 3 2 2
# 4 cccccc dddddd eeeeee 2 2 1
df <- structure(list(yname1 = c("aaaaaa", "aaabbb", "aaaaaa", "cccccc"
), yname2 = c("bbbaaa", "cccccc", "aaabbb", "dddddd"), yname3 = c("bbaaaa",
"aaaaaa", "dddddd", "eeeeee")), .Names = c("yname1", "yname2",
"yname3"), row.names = c(NA, -4L), class = "data.frame")
Upvotes: 6
Reputation: 1991
I think it would be faster to find the sum for each unique value and then join onto the original table.
all_yname <-c(df$yname1, df$yname2, df$yname3)
rcount <- as.data.frame(table(all_yname))
merge(df, rcount, by.x = 'yname1', by.y = 'all_yname')
And repeat the merge for each row.
Upvotes: 2