psyph
psyph

Reputation: 299

Extract numeric values from strings and split them according to original order

I'm trying to to extract all numeric values from a string-column in R that contains numeric and non-numeric values. My goal is to keep original order by replacing all accumulations of non-numeric values with commas.

My example data:

name <- c("./Stimuli\49stim_9_with_14_vs_23_mix2.png", "./Stimuli\54stim_14_with_15_vs_21_mix2.png", "./Stimuli\75stim_15_with_18_vs_26_incongruent.png")

My expected outcome:

expectedpoutcome <- c("49, 9, 14, 23, 2", "54, 14, 15, 21, 2", "75, 15, 18, 26")

The closest I could get:

library(stringr)

myoutcome <- name %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric

The problem with that list is that the information regarding which original string the numbers were from gets lost.

Upvotes: 1

Views: 941

Answers (3)

Ronak Shah
Ronak Shah

Reputation: 388817

Using base R, we can extract all the numeric values using gregexpr and regmatches and change them to comma-separated string using toString.

sapply(regmatches(name, gregexpr("[0-9]+", name)), toString)
#[1] "49, 9, 14, 23, 2" "14, 15, 21, 2"    "75, 15, 18, 26"

Upvotes: 1

Konrad Rudolph
Konrad Rudolph

Reputation: 545518

Your regular expression is correct. The issue, rather, is the code that comes after it: you are flattening the list (and thus losing the correspondence between numbers and original string), and then you’re converting the output to numbers, even though you indicated that you want to obtain a string.

So, start by removing the %>% unlist %>% as.numeric steps.

Next, there’s a neat trick to merge a list of strings into a single, comma-separated string: toString. So apply that over your list of results:

name %>% stringr::str_match_all("[0-9]+") %>% sapply(toString)

And there we have it.

Additionally you can simplify the regular expression: \d is identical to [0-9]; giving us:

name %>% stringr::str_match_all("\\d+") %>% sapply(toString)

And, lastly, your “expected outcome” is plain wrong, because you’re misinterpreting the meaning of backslash escape sequences in the string. Read the documentation on string escape sequences.

Alternatively, instead of matching all digits you can do the opposite: match everything that’s not a digit, and replace such runs by ', '. However, then you’ll afterwards need to remove leading and trailing commas:

trimws(gsub('\\D+', ', ', name), whitespace = ', ')

Upvotes: 3

Robin Gertenbach
Robin Gertenbach

Reputation: 10776

A tidyverse solution is this:

name %>% str_extract_all("\\d+") %>% map_chr(paste, collapse = ", ")

This does not produce the output you showed, the reason is that you are supplying escaped characters which end up not being numbers at all.
R won't know what was supplied as is and what was supplied via an escape sequence.

Upvotes: 1

Related Questions