How to remove first few characters from R column values?

Question

I have a column that consists of values that are separated by a "|" and generated this code but it takes everything before the "|", not after. Keep in mind this column is a "Factor".

INV | Building One
BO | Building Twenty Five
VC | Corporate

sub("([A-Za-z]+).*", "\1"

How do I remove the first portion before the "|" and keep only everything after in R using 'sub'?

Expected Output:

Building One
Building Twenty Five
Corporate

JBGruber · Accepted Answer

The regular expression you are looking for is ".*?\|".

. matches all characters
* zero or more times
? make * 'lazy'
\| match "|" which is also a regular expression so it must be escaped

Test:

df <- data.frame(col1 = c("INV | Building One", 
                          "BO | Building Twenty Five",
                          "VC | Corporate"))

sub(".*?\|", "", df$col1)
#> [1] " Building One"         " Building Twenty Five" " Corporate"

Here is a brilliant regex cheatsheet I use for this kind of stuff: https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

BTW: tidyr comes with a nice little function that would help here:

library(tidyr)
df %>% 
  separate(col1, into = c("col1", "col2"), sep = "\|")
#>   col1                  col2
#> 1 INV           Building One
#> 2  BO   Building Twenty Five
#> 3  VC              Corporate

It splits your one column into two, which seems plausible here.

How to remove first few characters from R column values?

Answers (2)

Related Questions