Nick Knauer
Nick Knauer

Reputation: 4243

Parse By nth delimitor in R

I have a dataframe like below:

Col1    Col2
   A    5!5!!6!!3!!m
   B    7_8!!6!!7!!t

structure(list(Col1 = c("A", "B"), Col2 = c("5!5!!6!!3!!m", "7_8!!6!!7!!t" )), class = "data.frame", row.names = c(NA, -2L))

How do I create a new column that extracts the 3rd parse of the strings found in Col2?

In SQL I am using SPLIT_PART function:

SPLIT_PART(Col2, '!!', 3)

I am looking to find an equivalent function in R.

Expected output:

Col1            Col2    Col3
   A    5!5!!6!!3!!m       3
   B    7_8!!6!!7!!t       7

Upvotes: 3

Views: 47

Answers (3)

akrun
akrun

Reputation: 887148

We can use str_extract to extract the digits

library(stringr)
df1 %>%
  mutate(Col3 = as.numeric(str_extract(Col2, "\\d+(?=!![a-z]+$)")))
#  Col1         Col2 Col3
#1    A 5!5!!6!!3!!m    3
#2    B 7_8!!6!!7!!t    7

If we need it by position, then

df1$Col3 <- as.numeric(sapply(strsplit(df1$Col2, "!!", fixed = TRUE), `[`, 3))
df1$Col3
#[1] 3 7

Or using gsubfn to create a position identifier and then extract the numbers before it

library(gsubfn)
p <- proto(fun = function(this, x)  if(count == 3) paste0(";", x))
as.numeric(str_extract(gsubfn("(!!)", p, df1$Col2), "\\d+(?=;)"))
#[1] 3 7

data

df1 <- structure(list(Col1 = c("A", "B"), Col2 = c("5!5!!6!!3!!m", "7_8!!6!!7!!t"
 )), class = "data.frame", row.names = c(NA, -2L))

Upvotes: 1

divibisan
divibisan

Reputation: 12155

Here's a tidyverse option, though the core is funcitonally identical to Rushabh's data.table based answer.

When given the simplify=T argument, stringr::str_split will output a matrix, with each match in a column. You can subset the desired column from it to extract the desired position:

library(tidyverse)

df1 %>%
    mutate(Col3 = str_split(Col2, pattern = '!!', simplify=T)[,3])

  Col1         Col2 Col3
1    A 5!5!!6!!3!!m  5!5
2    B 7_8!!6!!7!!t  7_8

df1 %>%
    mutate(Col3 = str_split(Col2, pattern = '!!', simplify=T)[,2])

  Col1         Col2 Col3
1    A 5!5!!6!!3!!m    6
2    B 7_8!!6!!7!!t    6

df1 %>%
  mutate(Col3 = str_split(Col2, pattern = '!!', simplify=T)[,1])

  Col1         Col2 Col3
1    A 5!5!!6!!3!!m  5!5
2    B 7_8!!6!!7!!t  7_8

Upvotes: 1

Rushabh Patel
Rushabh Patel

Reputation: 2764

You can use str_split from stringr package-

> library(stringr)
> library(data.table)
> setDT(dt)[,Col3:=sapply(Col2,function(x) unlist(str_split(x,"!!"))[3])]

Output-

> dt
    Col1      Col2        Col3
1:    A   5!5!!6!!3!!m      3
2:    B   7_8!!6!!7!!t      7

Note- You can change position from 3rd to nth in function.

Upvotes: 1

Related Questions