Reputation: 155
I'm trying to harmonized my df in R. I have a column that has been populated using different formats eg:
ID value
1 first value in 5' = 9 || second value in 20' = 8
2 first value in 5' = 6 || second value in 20' = 12
3 5,6
4 7,8
5 9 10
what I'm trying so far is
library(stringr)
df <- read.table("data", sep="\t", header = T)
value2 <- df$value %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric
I expect this output
ID value
1 9,8
2 6,12
3 5,6
4 7,8
5 9,10
but obviously, the actual output is an string [1] 9 8 6 12 5 6 7 8 9 10
based on your experience, any ideas? thanks!
Upvotes: 0
Views: 91
Reputation: 51592
One way is to split on ||
and use gsub
to extract the values after =
. We can then use toString()
to aggregate them, i.e.
library(tidyverse)
dd %>%
separate_rows(value, sep = ' \\|\\| ') %>%
mutate(value = gsub('.*=', '', value)) %>%
group_by(ID) %>%
summarise(value = toString(value))
which gives,
# A tibble: 5 x 2 ID value <int> <chr> 1 1 " 9, 8" 2 2 " 6, 12" 3 3 5,6 4 4 7,8 5 5 9 10
EDIT: To also convert the last space to a comma, we can add another mutate
to the pipeline, i.e.
dd %>%
separate_rows(value, sep = ' \\|\\| ') %>%
mutate(value = gsub('.*=', '', value)) %>%
group_by(ID) %>%
summarise(value = toString(value)) %>%
mutate(value = gsub(' ', ',', value))
Upvotes: 1