user12928769
user12928769

Reputation:

How to remove footnote references from a column in R?

probably a pretty basic question. I'm trying to analyse a data set that includes a column with footnote references in square brackets. I hope to remove these but I cannot seem to find the appropriate function/pattern to remove both them and the numbers inside the brackets.

Likewise, I would like take the median of those entries which have two numbers, e.g. "5930 – 6970[13]" and thereafter divide all entries by 1000.

              "1800[10]"         "3300 – 12 000[11]"       "5000 – 6000[12]"      
       "5930 – 6970[13]"       "75 000 – 80 000[14]"            "93 000[15]"           
 "195 000 – 210 000[16]"               "530 000[17]"           "595 000[18]"      

Any help will be highly appreciated.

Upvotes: 0

Views: 357

Answers (3)

Till
Till

Reputation: 6628

A dplyr/tidyverse solution: Similar to the answers by akrun and 42-, I use gsub() to get rid of the brackets, their contents and spaces. Next I tidyr::separate() the values in col1 into two columns. dplyr::rowise() and dplyr::mutate() are then used to calculate the median per row and devide by 1000.

library(dplyr)
library(tidyr)

df1 %>% 
  mutate(col1 = gsub("\\[(.*?)\\]|\\s+", "", col1)) %>% 
  separate(col1, into = c("col1", "col2"), sep = "–", fill = "right") %>% 
  mutate_at(vars(col1, col2), as.numeric) %>% 
  rowwise() %>% 
  mutate(result = median(c(col1, col2) / 1000, na.rm = TRUE))
#> Source: local data frame [9 x 3]
#> Groups: <by row>
#> 
#> # A tibble: 9 x 3
#>     col1   col2 result
#>    <dbl>  <dbl>  <dbl>
#> 1   1800     NA   1.8 
#> 2   3300  12000   7.65
#> 3   5000   6000   5.5 
#> 4   5930   6970   6.45
#> 5  75000  80000  77.5 
#> 6  93000     NA  93   
#> 7 195000 210000 202.  
#> 8 530000     NA 530   
#> 9 595000     NA 595

Alternatively everything can be done with a single mutate().

library(purrr)

  df1 %>% 
    rowwise() %>%
    mutate(col1 = gsub("\\[(.*?)\\]|\\s+", "", col1) %>% 
             strsplit("–") %>% 
             map_dbl(function(x) as.numeric(x) %>% 
                       median())/1000)
  #> Source: local data frame [9 x 1]
  #> Groups: <by row>
  #> 
  #> # A tibble: 9 x 1
  #>     col1
  #>    <dbl>
  #> 1   1.8 
  #> 2   7.65
  #> 3   5.5 
  #> 4   6.45
  #> 5  77.5 
  #> 6  93   
  #> 7 202.  
  #> 8 530   
  #> 9 595

I feel my first example is more explicit, while this one is a bit more obstructed and I usually try to avoid pipes within a mutate().

Upvotes: 0

IRTFM
IRTFM

Reputation: 263382

I interpreted the request a bit differently than @akrun. Thinking that the dashes separated numbers whose thousands-separator was a space I removed both spaces and trailing bracketed numbers using gsub instead of sub. Then the sapply procedure will be given at most two numbers in more comparable ranges:

df1 <- data.frame(col1 = scan(text=' "1800[10]"         "3300 – 12 000[11]"       "5000 – 6000[12]"      
       "5930 – 6970[13]"       "75 000 – 80 000[14]"            "93 000[15]"           
 "195 000 – 210 000[16]"               "530 000[17]"           "595 000[18]" ', what=""))

Processing:

sapply( 
     sapply( strsplit(gsub("[ ]|\\[.*$", "", df1$col1), "–"), # remove spc's and [n]
               as.numeric), 
       median)/1000
[1]   1.80   7.65   5.50   6.45  77.50  93.00 202.50 530.00 595.00

Upvotes: 2

akrun
akrun

Reputation: 887251

We could remove the substring that start with [ using sub, then split the stirng at either the space or space followed by - and any space into a list of vectors, convert it to numeric, get the median and divide by 1000

sapply(strsplit(sub("\\[.*$", "", df1$col1), "\\s*–\\s*|\\s+"), 
         function(x) median(as.numeric(x)))/1000

data

df1 <- structure(list(col1 = c("1800[10]", "3300 – 12 000[11]", "5000 – 6000[12]", 
"5930 – 6970[13]", "75 000 – 80 000[14]", "93 000[15]", "195 000 – 210 000[16]", 
"530 000[17]", "595 000[18]")), class = "data.frame", row.names = c(NA, 
-9L))

Upvotes: 2

Related Questions