Reputation: 549
I have a text file with over 10,000 lines, each line have a word that starts with the CDID_ followed by 10 more characters with no spaces as below:
a <- c("Test CDID_1254WE_1023 Sky","CDID_1254XE01478 Blue","This File named as CDID_ZXASWE_1111")
I would like to extract the words that start with CDID_ only to make the lines above look like this:
CDID_1254WE_1023
CDID_1254XE01478
CDID_ZXASWE_1111
Upvotes: 1
Views: 2530
Reputation: 109874
I'd use a lookbehind with the stringi package:
a <- c("Test CDID_1254WE_1023 Sky","CDID_1254XE01478 Blue","This File named as CDID_ZXASWE_1111")
library(stringi)
stringi::stri_extract_all_regex(a, '(?<=(^|\\s))(CDID_[^ ]+)')
(?<=(^|\\s))
= preceded by the beginning of the line or space; then CDID_
AND all then [^ ]+
= characters that follow that are not spaces.
[[1]]
[1] "CDID_1254WE_1023"
[[2]]
[1] "CDID_1254XE01478"
[[3]]
[1] "CDID_ZXASWE_1111"
You may want to use unlist
to force it into a vector.
Upvotes: 1
Reputation: 99341
Here are three base R options.
Option 1: Use sub()
, removing everything except the CDID_*
section:
sub(".*(CDID_\\S+).*", "\\1", a)
# [1] "CDID_1254WE_1023" "CDID_1254XE01478" "CDID_ZXASWE_1111"
Option 2: Use regexpr()
, extracting the CDID_*
section:
regmatches(a, regexpr("CDID_\\S+", a))
# [1] "CDID_1254WE_1023" "CDID_1254XE01478" "CDID_ZXASWE_1111"
Option 3: For a data frame result, we can use the new strcapture()
function (v3.4.0) and do all the work in a single call:
strcapture(".*(CDID_\\S+).*", a, data.frame(out = character()))
# out
# 1 CDID_1254WE_1023
# 2 CDID_1254XE01478
# 3 CDID_ZXASWE_1111
Upvotes: 7
Reputation: 39154
All the other solutions are great. Here is one solution using functions from stringr
package. We can first split the string using str_split
by space, convert the resulting list to a vector, and then use str_subset
to get strings with CDID_
in the beginning.
library(stringr)
str_subset(unlist(str_split(a, pattern = " ")), "^CDID_")
[1] "CDID_1254WE_1023" "CDID_1254XE01478" "CDID_ZXASWE_1111"
Upvotes: 1