how can I harmonized columns in a data frame with string and number (getting only numbers)?

Question

I'm trying to harmonized my df in R. I have a column that has been populated using different formats eg:

    ID       value
    1   first value in 5' = 9 || second value in 20' = 8
    2   first value in 5' = 6 || second value in 20' = 12
    3   5,6
    4   7,8
    5   9  10

what I'm trying so far is

     library(stringr)
     df <- read.table("data", sep="	", header = T)
     value2 <- df$value %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric

I expect this output

    ID       value
    1        9,8
    2        6,12
    3        5,6
    4        7,8
    5        9,10

but obviously, the actual output is an string [1] 9 8 6 12 5 6 7 8 9 10

based on your experience, any ideas? thanks!

Sotos · Accepted Answer

One way is to split on || and use gsub to extract the values after =. We can then use toString() to aggregate them, i.e.

library(tidyverse)

dd %>% 
 separate_rows(value, sep = ' \|\| ') %>% 
 mutate(value = gsub('.*=', '', value)) %>% 
 group_by(ID) %>% 
 summarise(value = toString(value))

which gives,

# A tibble: 5 x 2
     ID value    
       
1     1 " 9,  8" 
2     2 " 6,  12"
3     3 5,6      
4     4 7,8      
5     5 9 10

EDIT: To also convert the last space to a comma, we can add another mutate to the pipeline, i.e.

dd %>% 
     separate_rows(value, sep = ' \|\| ') %>% 
     mutate(value = gsub('.*=', '', value)) %>% 
     group_by(ID) %>% 
     summarise(value = toString(value)) %>%
     mutate(value = gsub(' ', ',', value))

how can I harmonized columns in a data frame with string and number (getting only numbers)?

Answers (1)

Related Questions