Dinho
Dinho

Reputation: 724

How to remove first few characters from R column values?

I have a column that consists of values that are separated by a "|" and generated this code but it takes everything before the "|", not after. Keep in mind this column is a "Factor".

INV | Building One
BO | Building Twenty Five
VC | Corporate

sub("([A-Za-z]+).*", "\\1"

How do I remove the first portion before the "|" and keep only everything after in R using 'sub'?

Expected Output:

Building One
Building Twenty Five
Corporate

Upvotes: 2

Views: 409

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 102890

Another approach of using sub

sub(".*\\|\\s+(.*)","\\1",s)

such that

> sub(".*\\|\\s+(.*)","\\1",s)
[1] "Building One"         "Building Twenty Five"
[3] "Corporate"  

Data

s <- c("INV | Building One", "BO | Building Twenty Five", "VC | Corporate")

Upvotes: 3

JBGruber
JBGruber

Reputation: 12478

The regular expression you are looking for is ".*?\\|".

  • . matches all characters
  • * zero or more times
  • ? make * 'lazy'
  • \\| match "|" which is also a regular expression so it must be escaped

Test:

df <- data.frame(col1 = c("INV | Building One", 
                          "BO | Building Twenty Five",
                          "VC | Corporate"))

sub(".*?\\|", "", df$col1)
#> [1] " Building One"         " Building Twenty Five" " Corporate"

Here is a brilliant regex cheatsheet I use for this kind of stuff: https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

BTW: tidyr comes with a nice little function that would help here:

library(tidyr)
df %>% 
  separate(col1, into = c("col1", "col2"), sep = "\\|")
#>   col1                  col2
#> 1 INV           Building One
#> 2  BO   Building Twenty Five
#> 3  VC              Corporate

It splits your one column into two, which seems plausible here.

Upvotes: 5

Related Questions