Reputation:
probably a pretty basic question. I'm trying to analyse a data set that includes a column with footnote references in square brackets. I hope to remove these but I cannot seem to find the appropriate function/pattern to remove both them and the numbers inside the brackets.
Likewise, I would like take the median of those entries which have two numbers, e.g. "5930 – 6970[13]" and thereafter divide all entries by 1000.
"1800[10]" "3300 – 12 000[11]" "5000 – 6000[12]"
"5930 – 6970[13]" "75 000 – 80 000[14]" "93 000[15]"
"195 000 – 210 000[16]" "530 000[17]" "595 000[18]"
Any help will be highly appreciated.
Upvotes: 0
Views: 357
Reputation: 6628
A dplyr
/tidyverse
solution: Similar to the answers by akrun and 42-,
I use gsub()
to get rid of the brackets, their contents and spaces.
Next I tidyr::separate()
the values in col1 into two columns.
dplyr::rowise()
and dplyr::mutate()
are then used to calculate the
median per row and devide by 1000.
library(dplyr)
library(tidyr)
df1 %>%
mutate(col1 = gsub("\\[(.*?)\\]|\\s+", "", col1)) %>%
separate(col1, into = c("col1", "col2"), sep = "–", fill = "right") %>%
mutate_at(vars(col1, col2), as.numeric) %>%
rowwise() %>%
mutate(result = median(c(col1, col2) / 1000, na.rm = TRUE))
#> Source: local data frame [9 x 3]
#> Groups: <by row>
#>
#> # A tibble: 9 x 3
#> col1 col2 result
#> <dbl> <dbl> <dbl>
#> 1 1800 NA 1.8
#> 2 3300 12000 7.65
#> 3 5000 6000 5.5
#> 4 5930 6970 6.45
#> 5 75000 80000 77.5
#> 6 93000 NA 93
#> 7 195000 210000 202.
#> 8 530000 NA 530
#> 9 595000 NA 595
Alternatively everything can be done with a single mutate()
.
library(purrr)
df1 %>%
rowwise() %>%
mutate(col1 = gsub("\\[(.*?)\\]|\\s+", "", col1) %>%
strsplit("–") %>%
map_dbl(function(x) as.numeric(x) %>%
median())/1000)
#> Source: local data frame [9 x 1]
#> Groups: <by row>
#>
#> # A tibble: 9 x 1
#> col1
#> <dbl>
#> 1 1.8
#> 2 7.65
#> 3 5.5
#> 4 6.45
#> 5 77.5
#> 6 93
#> 7 202.
#> 8 530
#> 9 595
I feel my first example is more explicit, while this one is a bit more
obstructed and I usually try to avoid pipes within a mutate()
.
Upvotes: 0
Reputation: 263382
I interpreted the request a bit differently than @akrun. Thinking that the dashes separated numbers whose thousands-separator was a space I removed both spaces and trailing bracketed numbers using gsub
instead of sub
. Then the sapply
procedure will be given at most two numbers in more comparable ranges:
df1 <- data.frame(col1 = scan(text=' "1800[10]" "3300 – 12 000[11]" "5000 – 6000[12]"
"5930 – 6970[13]" "75 000 – 80 000[14]" "93 000[15]"
"195 000 – 210 000[16]" "530 000[17]" "595 000[18]" ', what=""))
Processing:
sapply(
sapply( strsplit(gsub("[ ]|\\[.*$", "", df1$col1), "–"), # remove spc's and [n]
as.numeric),
median)/1000
[1] 1.80 7.65 5.50 6.45 77.50 93.00 202.50 530.00 595.00
Upvotes: 2
Reputation: 887251
We could remove the substring that start with [
using sub
, then split the stirng at either the space or space followed by -
and any space into a list
of vectors
, convert it to numeric
, get the median
and divide by 1000
sapply(strsplit(sub("\\[.*$", "", df1$col1), "\\s*–\\s*|\\s+"),
function(x) median(as.numeric(x)))/1000
df1 <- structure(list(col1 = c("1800[10]", "3300 – 12 000[11]", "5000 – 6000[12]",
"5930 – 6970[13]", "75 000 – 80 000[14]", "93 000[15]", "195 000 – 210 000[16]",
"530 000[17]", "595 000[18]")), class = "data.frame", row.names = c(NA,
-9L))
Upvotes: 2