Reputation: 173
I have the following string structure as a column value in my data frame :
Y: 10 ,W: 3 , cp: 0.05
the numeric values at each row differ but the structure remains the same. I want to split this string into 3 columns, each containing only the numbers. So there will be one column for Y with the corresponding numeric value, another for W and the last for cp.
I have tried using str_split in the following way:
str_split(string,pattern = " ,",simplify = TRUE )
which obviously gives me:
[,1] [,2] [,3]
[1,] "Y: 40 " "W: 2" " cp: 0.05"
Now, I want to keep only the numbers in each of those columns. Still learning this stuff so not sure how to proceed! Any help is highly appreciated!
Upvotes: 1
Views: 764
Reputation: 3
I believe this should work:
library(tidyverse)
string <- c("Y: 10 ,W: 3 , cp: 0.05","Y: 4 ,W: 9 , cp: 2.2")
dat <- tibble(x = string) %>%
separate(x,c("Y","W","cp"), sep = " ,")
dat2 <- dat %>% mutate_all(., ~str_remove(.,"\\D+"))
Upvotes: 0
Reputation: 39647
You can remove all unneeded characters e.g. with gsub
and then use strsplit
or read.csv
.
In base it would look like:
string <- c("Y: 10 ,W: 3 , cp: 0.05", "Y: 10 ,W: 3 , cp: 0.05")
read.csv(text=gsub("[[:alpha:]: ]", "", string), header=FALSE)
# V1 V2 V3
#1 10 3 0.05
#2 10 3 0.05
#or with strsplit
strsplit(gsub("[[:alpha:]: ]", "", string), ",")
Upvotes: 1
Reputation: 349
Given that your text strings are uniform it should be relatively simple to do, the first part would look like this:
txt <- c(
"Y: 10 ,W: 3 , cp: 0.05",
"Y: 6 ,W: 7 , cp: 0.08",
"Y: 5 ,W: 0 , cp: 0.08"
)
x <- do.call(rbind, strsplit(txt, split = " ,"))
And that would get a matrix of your "label: value"
library(stringr)
y <- matrix(data = str_extract(string = x,
pattern = "([0-9.]+)"),
ncol = ncol(x))
Will get you to text strings that signify your values, if you want, you can just use str_extract() without the matrix call to get your values as a vector, and:
z <- matrix(data = as.numeric(y),
ncol = ncol(x))
will get you your matrix as numerics, which it sounds like is what you're interested in.
all together it's fairly tidy, and without the intermediate matrix call, if don't need that, it would look like:
library(stringr)
txt <- c(
"Y: 10 ,W: 3 , cp: 0.05",
"Y: 6 ,W: 7 , cp: 0.08",
"Y: 5 ,W: 0 , cp: 0.08"
)
x <- do.call(rbind, strsplit(txt, split = " ,"))
y <- str_extract(string = x,
pattern = "([0-9.]+)")
z <- matrix(data = as.numeric(y),
ncol = ncol(x))
With z giving you a matrix of numerics.
Upvotes: 0
Reputation: 93803
I find sometimes reformatting name: value
pair data back to an existing structure helps to take care of complexity. In this case, I've formatted to a JSON object, and then used stream_in
from jsonlite to deal with the data.
This is nice because it will automatically name the columns, and also takes care of occasions when not every value is represented in every row, or the order changes. E.g.:
txt <- c(
"Y: 10 ,W: 3 , cp: 0.05",
"Y: 6 ,W: 7 , cp: 0.08",
"cp: 0.08, Y: 6 "
)
library(jsonlite)
proctxt <- paste("{", gsub("([A-Za-z]+?):", '"\\1":', txt), "}")
stream_in(textConnection(proctxt))
# Found 3 records...
# Imported 3 records. Simplifying...
# Y W cp
#1 10 3 0.05
#2 6 7 0.08
#3 6 NA 0.08
Upvotes: 1
Reputation: 18541
There are definitely nicer ways, but this should do the job:
Now updated for string vector with more than one element and bringing it into a matrix with three named columns. Should work on vectors of any length.
library(stringr)
string <- c("Y: 10 ,W: 3 , cp: 0.05","Y: 4 ,W: 9 , cp: 2.2")
vec <- t(str_split(str_split(string, " ,", simplify = TRUE), ": ", simplify = TRUE)[,2])
mtx = matrix(
vec,
nrow = length(vec)/3,
ncol = 3)
colnames(mtx) <- c("Y","W","cp")
mtx
Upvotes: 4
Reputation: 940
Maybe not the most elegant way but it works:
library(dplyr)
library(stringr)
library(tidyr)
tibble(row = c(1,2),
col = c("Y: 10 ,W: 3 , cp: 0.05","Y: 4 ,W: 9 , cp: 2.2")) %>%
separate(col, into=c("col1", "col2", "col3"), sep = ",") %>%
gather(id, col, -row) %>%
select(-id) %>%
mutate(col = str_trim(col)) %>%
separate(col, into=c("letter", "number"), sep=":") %>%
mutate(number = str_trim(number)) %>%
spread(letter, number) %>%
select(-row)
# A tibble: 2 x 3
cp W Y
<chr> <chr> <chr>
1 0.05 3 10
2 2.2 9 4
Note that I had to add a new column named row
to your data frame to make this approach work
Upvotes: 1