Gotmadstacks
Gotmadstacks

Reputation: 369

R - Splitting strings in a column on a character and keeping specific results

This link is 90% of the way to solving what I want do figure out: R Split String By Delimiter in a column

Here's the example input:

A               B       C    
awer.ttp.net    Code    554
abcd.ttp.net    Code    747
asdf.ttp.net    Part    554
xyz.ttp.net     Part    747

And the desired result:

library(dplyr)
df = df %>% mutate(D=gsub("\\..*","",A))

A    B   C    D
awer.ttp.net Code 554 awer
abcd.ttp.net Code 747 abcd
asdf.ttp.net Part 554 asdf
xyz.ttp.net Part 747  xyz

But this only gives you the string before the first dot. What if you want the following?

A    B   C    D
awer.ttp.net Code 554 ttp
abcd.ttp.net Code 747 ttp
asdf.ttp.net Part 554 ttp
xyz.ttp.net Part 747  ttp

Upvotes: 1

Views: 1386

Answers (2)

dwcoder
dwcoder

Reputation: 488

You can use the strsplit function for this, and wrap it in a function that returns the part you want.

Make your dataframe

temp <- "A               B       C
awer.ttp.net    Code    554
abcd.ttp.net    Code    747
asdf.ttp.net    Part    554
xyz.ttp.net     Part    747
"
df <- read.table(textConnection(temp), header=TRUE, as.is=TRUE )

We want use the strsplit function, which splits a string at a given pattern, and returns a list containing a vector with the different strings. For instance:

strsplit("A-B-C-D", "-")
#[[1]]
#[1] "A" "B" "C" "D"

Wrap this into a function that returns a specified part

mystrsplit <- function(x, pattern, part=2){
  return(strsplit(x, pattern)[[1]][part])
}
# Vectorize it so that it can handle vector arguments of x
mystrsplit <- Vectorize(mystrsplit, vectorize.args = "x")

Use our mystrsplit function in mutate:

library(dplyr)
df %>% mutate(D=mystrsplit(A, '\\.', 2))

#             A    B   C   D
#1 awer.ttp.net Code 554 ttp
#2 abcd.ttp.net Code 747 ttp
#3 asdf.ttp.net Part 554 ttp
#4  xyz.ttp.net Part 747 ttp

Upvotes: 0

akrun
akrun

Reputation: 887511

We can capture as a group. Match one or more characters that are not a . ([^.]+) from the beginning (^) of string followed by a . followed by another set of characters that are not a dot captured as a group (([^.]+)) followed by other character and replace with the backreference (\\1) of the captured group

library(dplyr)
df1 %>%
    mutate(D= sub("^[^.]+\\.([^.]+)\\..*", "\\1", A))
#             A    B   C   D
#1 awer.ttp.net Code 554 ttp
#2 abcd.ttp.net Code 747 ttp
#3 asdf.ttp.net Part 554 ttp
#4  xyz.ttp.net Part 747 ttp

Or using extract

library(tidyr)
df1 %>% 
   extract(A, into = 'D', "^[^.]+\\.([^.]+).*", remove = FALSE)

Note that we don't need the dplyr for this

df1$D <- sub("^[^.]+\\.([^.]+)\\..*", "\\1", df1$A)

Upvotes: 1

Related Questions