How to delete all data after the third colon in strings/ substrings in R?

Question

So I have a series of about 200,000 data points that look like this: DATA:abc:de123fg:12ghk8d and DATA:ghi:56kdv:128485hg. The only identifying data that I need to look at is before the third colon. I want to remove everything after the third colon so I can aggregate unique identifiers from the rest of the substring..

So far, I have attempted to use str_remove_all and gsub to remove everything after the third colon. The problem with this is that sometimes the data points are grouped together in the same string like this:

DATA:ghi:56kdv:128485hg|DATA:abc:de123fg:12ghk8d

So string_remove_all is just removing the end of the last substring and it ends up looking like this:

DATA:ghi:56kdv:128485hg|DATA:abc:de123fg

Does anyone know how I can accomplish this task? Thanks in advance..

MSR · Accepted Answer

Here's an option in base R with regmatches and regexpr:

str <- c("DATA:abc:de123fg:12ghk8d", "DATA:ghi:56kdv:128485hg|DATA:abc:de123fg:12ghk8d")

regmatches(str, regexpr("[^:]*:[^:]*", str)) 
#> [1] "DATA:abc" "DATA:ghi"

And the corresponding solution in stringr, if you prefer:

library(stringr)

str_extract(str, "[^:]*:[^:]*")
#> [1] "DATA:abc" "DATA:ghi"

^{Created on 2019-12-03 by the reprex package (v0.3.0)}

How to delete all data after the third colon in strings/ substrings in R?

Answers (1)

Related Questions