Reputation: 155

Truncating the end of a string in R after a character that can be present zero or more times

I have the following data:

temp<-c("AIR BAGS:FRONTAL" ,"SERVICE BRAKES HYDRAULIC:ANTILOCK",
    "PARKING BRAKE:CONVENTIONAL",
    "SEATS:FRONT ASSEMBLY:POWER ADJUST",
    "POWER TRAIN:AUTOMATIC TRANSMISSION",
    "SUSPENSION",
    "ENGINE AND ENGINE COOLING:ENGINE",
    "SERVICE BRAKES HYDRAULIC:ANTILOCK",
    "SUSPENSION:FRONT",
    "ENGINE AND ENGINE COOLING:ENGINE",
    "VISIBILITY:WINDSHIELD WIPER/WASHER:LINKAGES")

I would like to create a new vector that retains only the text before the first ":" in the cases where a ":" is present, and the whole word when ":" is not present.

I have tried to use:

temp=data.frame(matrix(unlist(str_split(temp,pattern=":",n=2)), 
+                        ncol=2, byrow=TRUE))

but it does not work in the cases where there is no ":"

I know this question is very similar to: truncate string from a certain character in R, which used:

sub("^[^.]*", "", x)

But I am not very familiar with regular expressions and have struggled to reverse that example to retain only the beginning of the string.

Upvotes: 8

Answers (5)

Andrie

Reputation: 179488

You can solve this with a simple regex:

sub("(.*?):.*", "\\1", x)
 [1] "AIR BAGS"                  "SERVICE BRAKES HYDRAULIC"  "PARKING BRAKE"             "SEATS"                    
 [5] "POWER TRAIN"               "SUSPENSION"                "ENGINE AND ENGINE COOLING" "SERVICE BRAKES HYDRAULIC" 
 [9] "SUSPENSION"                "ENGINE AND ENGINE COOLING" "VISIBILITY"

How the regex works:

"(.*?):.*" Look for a repeated set of any characters .* but modify it with ? to not be greedy. This should be followed by a colon and then any character (repeated)
Substitute the entire string with the bit found inside the parentheses - "\\1"

The bit to understand is that any regex match is greedy by default. By modifying it to be non-greedy, the first pattern match can not include the colon, since the first character after the parentheses is a colon. The regex after the colon is back to the default, i.e. greedy.

Upvotes: 16

shhhhimhuntingrabbits

Reputation: 7475

sorry to add this as an answer. In response to times taken:

> yy<-rep("foo1:bar1",times=100000)
> system.time(yy1<-sapply(strsplit(yy,":"),'[',1))
   user  system elapsed 
   0.26    0.00    0.27 
> 
> system.time(yy2<-sub("(.*?):.*", "\\1", yy))
   user  system elapsed 
    0.1     0.0     0.1 
> 
> system.time(yy3 <- sub(":.*$", "", yy ))
   user  system elapsed 
   0.08    0.00    0.07 
> 
> system.time(yy4<-gsub("([^:]*).*","\\1",yy))
   user  system elapsed 
   0.09    0.00    0.09

The regex are roughly equivalent the strsplit takes a bit longer

Upvotes: 3

Greg Snow

Reputation: 49650

Another approach is to look for the first ":" and replace it and anything after it with nothing:

yy <- sub(":.*$", "", yy )

If no ":" is found then nothing is substituted and you get the whole of the original string. If there is a ":" then the first one is matched along with everything after it, this is then replace with nothing ("") which deletes it and leaves everything up to that first colon.

Upvotes: 9

shhhhimhuntingrabbits

Reputation: 7475

in this case

yy<-c("AIR BAGS:FRONTAL",
"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"PARKING BRAKE:CONVENTIONAL",
"SEATS:FRONT ASSEMBLY:POWER ADJUST",
"POWER TRAIN:AUTOMATIC TRANSMISSION",
"SUSPENSION",
"ENGINE AND ENGINE COOLING:ENGINE",
"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"SUSPENSION:FRONT",
"ENGINE AND ENGINE COOLING:ENGINE",
"VISIBILITY:WINDSHIELD WIPER/WASHER:LINKAGES")
yy<-gsub("([^:]*).*","\\1",yy)
yy

may work for you

Upvotes: 1

joran

Reputation: 173667

Does this work (assuming your data is in a character vector):

x <- c('foobar','foo:bar','foo1:bar1 foo:bar','foo bar')
> sapply(str_split(x,":"),'[',1)
[1] "foobar"  "foo"     "foo1"    "foo bar"

Upvotes: 3

Truncating the end of a string in R after a character that can be present zero or more times

Answers (5)

Related Questions