Reputation: 299
I'm trying to to extract all numeric values from a string-column in R that contains numeric and non-numeric values. My goal is to keep original order by replacing all accumulations of non-numeric values with commas.
My example data:
name <- c("./Stimuli\49stim_9_with_14_vs_23_mix2.png", "./Stimuli\54stim_14_with_15_vs_21_mix2.png", "./Stimuli\75stim_15_with_18_vs_26_incongruent.png")
My expected outcome:
expectedpoutcome <- c("49, 9, 14, 23, 2", "54, 14, 15, 21, 2", "75, 15, 18, 26")
The closest I could get:
library(stringr)
myoutcome <- name %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric
The problem with that list is that the information regarding which original string the numbers were from gets lost.
Upvotes: 1
Views: 941
Reputation: 388817
Using base R, we can extract all the numeric values using gregexpr
and regmatches
and change them to comma-separated string using toString
.
sapply(regmatches(name, gregexpr("[0-9]+", name)), toString)
#[1] "49, 9, 14, 23, 2" "14, 15, 21, 2" "75, 15, 18, 26"
Upvotes: 1
Reputation: 545518
Your regular expression is correct. The issue, rather, is the code that comes after it: you are flattening the list (and thus losing the correspondence between numbers and original string), and then you’re converting the output to numbers, even though you indicated that you want to obtain a string.
So, start by removing the %>% unlist %>% as.numeric
steps.
Next, there’s a neat trick to merge a list of strings into a single, comma-separated string: toString
. So apply that over your list of results:
name %>% stringr::str_match_all("[0-9]+") %>% sapply(toString)
And there we have it.
Additionally you can simplify the regular expression: \d
is identical to [0-9]
; giving us:
name %>% stringr::str_match_all("\\d+") %>% sapply(toString)
And, lastly, your “expected outcome” is plain wrong, because you’re misinterpreting the meaning of backslash escape sequences in the string. Read the documentation on string escape sequences.
Alternatively, instead of matching all digits you can do the opposite: match everything that’s not a digit, and replace such runs by ', '
. However, then you’ll afterwards need to remove leading and trailing commas:
trimws(gsub('\\D+', ', ', name), whitespace = ', ')
Upvotes: 3
Reputation: 10776
A tidyverse solution is this:
name %>% str_extract_all("\\d+") %>% map_chr(paste, collapse = ", ")
This does not produce the output you showed, the reason is that you are supplying escaped characters which end up not being numbers at all.
R won't know what was supplied as is and what was supplied via an escape sequence.
Upvotes: 1