Reputation: 155
I have the following data:
temp<-c("AIR BAGS:FRONTAL" ,"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"PARKING BRAKE:CONVENTIONAL",
"SEATS:FRONT ASSEMBLY:POWER ADJUST",
"POWER TRAIN:AUTOMATIC TRANSMISSION",
"SUSPENSION",
"ENGINE AND ENGINE COOLING:ENGINE",
"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"SUSPENSION:FRONT",
"ENGINE AND ENGINE COOLING:ENGINE",
"VISIBILITY:WINDSHIELD WIPER/WASHER:LINKAGES")
I would like to create a new vector that retains only the text before the first ":" in the cases where a ":" is present, and the whole word when ":" is not present.
I have tried to use:
temp=data.frame(matrix(unlist(str_split(temp,pattern=":",n=2)),
+ ncol=2, byrow=TRUE))
but it does not work in the cases where there is no ":"
I know this question is very similar to: truncate string from a certain character in R, which used:
sub("^[^.]*", "", x)
But I am not very familiar with regular expressions and have struggled to reverse that example to retain only the beginning of the string.
Upvotes: 8
Views: 14606
Reputation: 179488
You can solve this with a simple regex:
sub("(.*?):.*", "\\1", x)
[1] "AIR BAGS" "SERVICE BRAKES HYDRAULIC" "PARKING BRAKE" "SEATS"
[5] "POWER TRAIN" "SUSPENSION" "ENGINE AND ENGINE COOLING" "SERVICE BRAKES HYDRAULIC"
[9] "SUSPENSION" "ENGINE AND ENGINE COOLING" "VISIBILITY"
How the regex works:
"(.*?):.*"
Look for a repeated set of any characters .*
but modify it with ?
to not be greedy. This should be followed by a colon and then any character (repeated)"\\1"
The bit to understand is that any regex match is greedy by default. By modifying it to be non-greedy, the first pattern match can not include the colon, since the first character after the parentheses is a colon. The regex after the colon is back to the default, i.e. greedy.
Upvotes: 16
Reputation: 7475
sorry to add this as an answer. In response to times taken:
> yy<-rep("foo1:bar1",times=100000)
> system.time(yy1<-sapply(strsplit(yy,":"),'[',1))
user system elapsed
0.26 0.00 0.27
>
> system.time(yy2<-sub("(.*?):.*", "\\1", yy))
user system elapsed
0.1 0.0 0.1
>
> system.time(yy3 <- sub(":.*$", "", yy ))
user system elapsed
0.08 0.00 0.07
>
> system.time(yy4<-gsub("([^:]*).*","\\1",yy))
user system elapsed
0.09 0.00 0.09
The regex are roughly equivalent the strsplit takes a bit longer
Upvotes: 3
Reputation: 49650
Another approach is to look for the first ":" and replace it and anything after it with nothing:
yy <- sub(":.*$", "", yy )
If no ":" is found then nothing is substituted and you get the whole of the original string. If there is a ":" then the first one is matched along with everything after it, this is then replace with nothing ("") which deletes it and leaves everything up to that first colon.
Upvotes: 9
Reputation: 7475
in this case
yy<-c("AIR BAGS:FRONTAL",
"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"PARKING BRAKE:CONVENTIONAL",
"SEATS:FRONT ASSEMBLY:POWER ADJUST",
"POWER TRAIN:AUTOMATIC TRANSMISSION",
"SUSPENSION",
"ENGINE AND ENGINE COOLING:ENGINE",
"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"SUSPENSION:FRONT",
"ENGINE AND ENGINE COOLING:ENGINE",
"VISIBILITY:WINDSHIELD WIPER/WASHER:LINKAGES")
yy<-gsub("([^:]*).*","\\1",yy)
yy
may work for you
Upvotes: 1
Reputation: 173667
Does this work (assuming your data is in a character vector):
x <- c('foobar','foo:bar','foo1:bar1 foo:bar','foo bar')
> sapply(str_split(x,":"),'[',1)
[1] "foobar" "foo" "foo1" "foo bar"
Upvotes: 3