Curious
Curious

Reputation: 549

Extract specific words from a text file?

I have a text file with over 10,000 lines, each line have a word that starts with the CDID_ followed by 10 more characters with no spaces as below:

a <- c("Test CDID_1254WE_1023 Sky","CDID_1254XE01478 Blue","This File named as CDID_ZXASWE_1111")

I would like to extract the words that start with CDID_ only to make the lines above look like this:

CDID_1254WE_1023
CDID_1254XE01478
CDID_ZXASWE_1111

Upvotes: 1

Views: 2530

Answers (3)

Tyler Rinker
Tyler Rinker

Reputation: 109874

I'd use a lookbehind with the stringi package:

a <- c("Test CDID_1254WE_1023 Sky","CDID_1254XE01478 Blue","This File named as CDID_ZXASWE_1111")

library(stringi)

stringi::stri_extract_all_regex(a, '(?<=(^|\\s))(CDID_[^ ]+)')

(?<=(^|\\s)) = preceded by the beginning of the line or space; then CDID_ AND all then [^ ]+ = characters that follow that are not spaces.

[[1]]
[1] "CDID_1254WE_1023"

[[2]]
[1] "CDID_1254XE01478"

[[3]]
[1] "CDID_ZXASWE_1111"

You may want to use unlist to force it into a vector.

Upvotes: 1

Rich Scriven
Rich Scriven

Reputation: 99341

Here are three base R options.

Option 1: Use sub(), removing everything except the CDID_* section:

sub(".*(CDID_\\S+).*", "\\1", a)
# [1] "CDID_1254WE_1023" "CDID_1254XE01478" "CDID_ZXASWE_1111"

Option 2: Use regexpr(), extracting the CDID_* section:

regmatches(a, regexpr("CDID_\\S+", a))
# [1] "CDID_1254WE_1023" "CDID_1254XE01478" "CDID_ZXASWE_1111"

Option 3: For a data frame result, we can use the new strcapture() function (v3.4.0) and do all the work in a single call:

strcapture(".*(CDID_\\S+).*", a, data.frame(out = character()))
#                out
# 1 CDID_1254WE_1023
# 2 CDID_1254XE01478
# 3 CDID_ZXASWE_1111

Upvotes: 7

www
www

Reputation: 39154

All the other solutions are great. Here is one solution using functions from stringr package. We can first split the string using str_split by space, convert the resulting list to a vector, and then use str_subset to get strings with CDID_ in the beginning.

library(stringr)

str_subset(unlist(str_split(a, pattern = " ")), "^CDID_")
[1] "CDID_1254WE_1023" "CDID_1254XE01478" "CDID_ZXASWE_1111"

Upvotes: 1

Related Questions