Reputation: 385
I have a data frame that I want to split the first column's text string into two columns but only after the second whitespace in the sequence. Here is a sample:
test22 Ticker
1 Current SharePrice $6.57 MFM
2 Current NAV $7.11 MFM
3 Current Premium/Discount -7.59% MFM
4 52WkAvg SharePrice $6.55 MFM
5 52WkAvg NAV $7.21 MFM
6 52WkAvg Premium/Discount -9.19% MFM
Essentially, if the end result would be a data frame with three columns total and the price/% field being its own separate column. Thanks!
Upvotes: 0
Views: 951
Reputation: 714
Here's an option using dplyr
and stringr
:
library(dplyr)
library(stringr)
data <-
tibble(test22 = c("Current SharePrice $6.57",
"Current NAV $7.11",
"Current Premium/Discount -7.59%",
"52WkAvg SharePrice $6.55",
"52WkAvg NAV $7.21",
"52WkAvg Premium/Discount -9.19%"),
Ticker = "MFM")
data %>%
mutate(category = str_replace(test22, "^(.+ .+) (.+)$", "\\1"),
price_pc = str_replace(test22, "^(.+ .+) (.+)$", "\\2"))
# A tibble: 6 x 4
test22 Ticker category price_pc
<chr> <chr> <chr> <chr>
1 Current SharePrice $6.57 MFM Current SharePrice $6.57
2 Current NAV $7.11 MFM Current NAV $7.11
3 Current Premium/Discount -7.59% MFM Current Premium/Discount -7.59%
4 52WkAvg SharePrice $6.55 MFM 52WkAvg SharePrice $6.55
5 52WkAvg NAV $7.21 MFM 52WkAvg NAV $7.21
6 52WkAvg Premium/Discount -9.19% MFM 52WkAvg Premium/Discount -9.19%
EDIT: Explanation of the Regex used
Ignoring the brackets for a second:
^ = start of the string
. = any character except a new line
+ = at least one of the previous character (in this case, any character except a new line)
$ = end of the string
So "^(.+ .+) (.+)$"
looks for strings which start, have some characters, then a space, then some characters, then a space, then some more characters, then end.
The brackets are added as "capture groups", meaning that the query "remembers" the part of the string which is represented by these brackets, and can be extracted by referring to the order of the brackets. Hence "\\1"
returns what was captured by the first bracket, and "\\2"
returns what was captured by the second.
A good resource for learning Regex is Regexr.
Upvotes: 1
Reputation: 887128
One option in base r is to create a delimiter ,
with sub
and then use read.csv
:
out <- cbind(read.csv(text = sub(" (\\S+)$", ",\\1", df1$test22),
header = FALSE, stringsAsFactors = FALSE), df1[2])
out
#. V1 V2 Ticker
#1 Current SharePrice $6.57 MFM
#2 Current NAV $7.11 MFM
#3 Current Premium/Discount -7.59% MFM
#4 52WkAvg SharePrice $6.55 MFM
#5 52WkAvg NAV $7.21 MFM
#6 52WkAvg Premium/Discount -9.19% MFM
Or using extract
from tidyr
library(tidyverse)
df1 %>%
extract(test22, into = c("V1", "V2"), "^(\\S+\\s+\\S+)\\s+(.*)")
# V1 V2 Ticker
#1 Current SharePrice $6.57 MFM
#2 Current NAV $7.11 MFM
#3 Current Premium/Discount -7.59% MFM
#4 52WkAvg SharePrice $6.55 MFM
#5 52WkAvg NAV $7.21 MFM
#6 52WkAvg Premium/Discount -9.19% MFM
df1 <- structure(list(test22 = c("Current SharePrice $6.57", "Current NAV $7.11",
"Current Premium/Discount -7.59%", "52WkAvg SharePrice $6.55",
"52WkAvg NAV $7.21", "52WkAvg Premium/Discount -9.19%"), Ticker = c("MFM",
"MFM", "MFM", "MFM", "MFM", "MFM")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Upvotes: 1
Reputation: 5281
Here is an option using strsplit
data.frame(do.call(rbind, strsplit(df$test22, '\\s(?!.*\\s)', perl = TRUE)),
Ticker=df$Ticker)
# X1 X2 Ticker
# 1 Current SharePrice $6.57 MFM
# 2 Current NAV $7.11 MFM
# 3 Current Premium/Discount -7.59% MFM
# 4 52WkAvg SharePrice $6.55 MFM
# 5 52WkAvg NAV $7.21 MFM
# 6 52WkAvg Premium/Discount -9.19% MFM
Or using gsub
gsub('.*\\s.*?\\s(.*)','\\1', df$test22, perl = TRUE)
# [1] "$6.57" "$7.11" "-7.59%" "$6.55" "$7.21" "-9.19%"
# or if factors
# gsub('.*\\s.*?\\s(.*)','\\1', as.character(df$test22), perl = TRUE)
The advantage of the second is that it truly considers the second spacing character (as opposed to last spacing).
Upvotes: 0