Reputation: 2443
I have a huge data frame, a sample of 3 columns and 11 rows is given below:
df <- structure(list(A = c(61960, 273, 439, 38877, 75325, 80929,
23028, 57240, 10140, 25775, 7286), B = c(10, 12, 11, 13, 2, 1, 1,
1, 1, 1, 1), C = c(122, 140, 163, 12, 190, 16, 14, 18, 15, 17, 16
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-11L))
For each column of the data frame, I would like to calculate the median number of significant digits for each order of magnitude in that column.
So for example, for column A above, there are 3 orders of magnitude present (10^3, 10^4, 10^5). The first number has 4 digits (last zero doesn't count), second has 3, and so on.
My output should be a list for each column, with one element a vector containing the orders of magnitude, and the second the median number of significant digits. So for each column I am expecting a list, my output would be a list of lists. For example for column A:
L[["A"]] = list(c(5,4,3), c(5, 4, 3))
Why is this the list? In column A there are 3 different orders of magnitude: 10^5, 10^4, 10^3. The median number of significant digits for the 10^5 o.o.m is 5, for 10^4, 4, and for 10^3, 3.
Is there a way to do this efficiently? with something like mutate
or map
(not apply
, because this would be the same as using a loop).
Upvotes: 1
Views: 193
Reputation: 887048
We can do this by looping over the columns, then grouped by the nchar
of the column, remove the 0s at the end with sub
, get the median
and return a list
of the median along with the grouping variable in tapply
(returned as the names of the named vector)
lapply(df, function(x) {
x1 <- tapply(nchar(sub("0+$", "", x)), nchar(x), FUN = median )
list(as.integer(names(x1)), as.numeric(x1))
})
#$A
#$A[[1]]
#[1] 3 4 5
#$A[[2]]
#[1] 3 4 5
#$B
#$B[[1]]
#[1] 1 2
#$B[[2]]
#[1] 1 2
#$C
#$C[[1]]
#[1] 2 3
#$C[[2]]
#[1] 2.0 2.5
Or this can be also done with tidyverse
and return as a single dataset
library(tidyverse)
df %>%
mutate_all(str_remove, "0+$") %>%
map2_dfr(., df, ~
tibble(x = nchar(.x), grp = nchar(.y)) %>%
group_by(grp) %>%
summarise(x = median(x)), .id = 'colName')
# A tibble: 7 x 3
# colName grp x
# <chr> <int> <dbl>
#1 A 3 3
#2 A 4 4
#3 A 5 5
#4 B 1 1
#5 B 2 2
#6 C 2 2
#7 C 3 2.5
Upvotes: 1