js80
js80

Reputation: 385

Conditional Data Frame String Split Based on Second White Space

I have a data frame that I want to split the first column's text string into two columns but only after the second whitespace in the sequence. Here is a sample:

                           test22 Ticker
1        Current SharePrice $6.57    MFM
2               Current NAV $7.11    MFM
3 Current Premium/Discount -7.59%    MFM
4        52WkAvg SharePrice $6.55    MFM
5               52WkAvg NAV $7.21    MFM
6 52WkAvg Premium/Discount -9.19%    MFM

Essentially, if the end result would be a data frame with three columns total and the price/% field being its own separate column. Thanks!

Upvotes: 0

Views: 951

Answers (3)

g_t_m
g_t_m

Reputation: 714

Here's an option using dplyr and stringr:

library(dplyr)
library(stringr)

data <-
  tibble(test22 = c("Current SharePrice $6.57",
                    "Current NAV $7.11",
                    "Current Premium/Discount -7.59%",
                    "52WkAvg SharePrice $6.55",
                    "52WkAvg NAV $7.21",
                    "52WkAvg Premium/Discount -9.19%"),
         Ticker = "MFM")

data %>% 
  mutate(category = str_replace(test22, "^(.+ .+) (.+)$", "\\1"),
         price_pc = str_replace(test22, "^(.+ .+) (.+)$", "\\2"))


# A tibble: 6 x 4
test22                          Ticker category                 price_pc
<chr>                           <chr>  <chr>                    <chr>   
1 Current SharePrice $6.57        MFM    Current SharePrice       $6.57   
2 Current NAV $7.11               MFM    Current NAV              $7.11   
3 Current Premium/Discount -7.59% MFM    Current Premium/Discount -7.59%  
4 52WkAvg SharePrice $6.55        MFM    52WkAvg SharePrice       $6.55   
5 52WkAvg NAV $7.21               MFM    52WkAvg NAV              $7.21   
6 52WkAvg Premium/Discount -9.19% MFM    52WkAvg Premium/Discount -9.19% 

EDIT: Explanation of the Regex used

Ignoring the brackets for a second:

^ = start of the string

. = any character except a new line

+ = at least one of the previous character (in this case, any character except a new line)

$ = end of the string

So "^(.+ .+) (.+)$" looks for strings which start, have some characters, then a space, then some characters, then a space, then some more characters, then end.

The brackets are added as "capture groups", meaning that the query "remembers" the part of the string which is represented by these brackets, and can be extracted by referring to the order of the brackets. Hence "\\1" returns what was captured by the first bracket, and "\\2" returns what was captured by the second.

A good resource for learning Regex is Regexr.

Upvotes: 1

akrun
akrun

Reputation: 887128

One option in base is to create a delimiter , with sub and then use read.csv:

out <- cbind(read.csv(text = sub(" (\\S+)$", ",\\1", df1$test22), 
       header = FALSE, stringsAsFactors = FALSE), df1[2])
out
#.                       V1     V2 Ticker
#1       Current SharePrice  $6.57    MFM
#2              Current NAV  $7.11    MFM
#3 Current Premium/Discount -7.59%    MFM
#4       52WkAvg SharePrice  $6.55    MFM
#5              52WkAvg NAV  $7.21    MFM
#6 52WkAvg Premium/Discount -9.19%    MFM

Or using extract from tidyr

library(tidyverse)
df1 %>% 
     extract(test22, into = c("V1", "V2"), "^(\\S+\\s+\\S+)\\s+(.*)")
#                        V1     V2 Ticker
#1       Current SharePrice  $6.57    MFM
#2              Current NAV  $7.11    MFM
#3 Current Premium/Discount -7.59%    MFM
#4       52WkAvg SharePrice  $6.55    MFM
#5              52WkAvg NAV  $7.21    MFM
#6 52WkAvg Premium/Discount -9.19%    MFM

data

df1 <- structure(list(test22 = c("Current SharePrice $6.57", "Current NAV $7.11", 
  "Current Premium/Discount -7.59%", "52WkAvg SharePrice $6.55", 
 "52WkAvg NAV $7.21", "52WkAvg Premium/Discount -9.19%"), Ticker = c("MFM", 
 "MFM", "MFM", "MFM", "MFM", "MFM")), class = "data.frame", row.names = c("1", 
  "2", "3", "4", "5", "6"))

Upvotes: 1

niko
niko

Reputation: 5281

Here is an option using strsplit

data.frame(do.call(rbind, strsplit(df$test22, '\\s(?!.*\\s)', perl = TRUE)), 
           Ticker=df$Ticker)
#                         X1     X2 Ticker
# 1       Current SharePrice  $6.57    MFM
# 2              Current NAV  $7.11    MFM
# 3 Current Premium/Discount -7.59%    MFM
# 4       52WkAvg SharePrice  $6.55    MFM
# 5              52WkAvg NAV  $7.21    MFM
# 6 52WkAvg Premium/Discount -9.19%    MFM

Or using gsub

gsub('.*\\s.*?\\s(.*)','\\1', df$test22, perl = TRUE)
# [1] "$6.57"  "$7.11"  "-7.59%" "$6.55"  "$7.21"  "-9.19%"
# or if factors
# gsub('.*\\s.*?\\s(.*)','\\1', as.character(df$test22), perl = TRUE)

The advantage of the second is that it truly considers the second spacing character (as opposed to last spacing).

Upvotes: 0

Related Questions