Reputation: 284
So I have a series of about 200,000 data points that look like this: DATA:abc:de123fg:12ghk8d
and DATA:ghi:56kdv:128485hg
. The only identifying data that I need to look at is before the third colon. I want to remove everything after the third colon so I can aggregate unique identifiers from the rest of the substring..
So far, I have attempted to use str_remove_all
and gsub
to remove everything after the third colon. The problem with this is that sometimes the data points are grouped together in the same string like this:
DATA:ghi:56kdv:128485hg
|DATA:abc:de123fg:12ghk8d
So string_remove_all is just removing the end of the last substring and it ends up looking like this:
DATA:ghi:56kdv:128485hg
|DATA:abc:de123fg
Does anyone know how I can accomplish this task? Thanks in advance..
Upvotes: 1
Views: 201
Reputation: 2881
Here's an option in base R with regmatches
and regexpr
:
str <- c("DATA:abc:de123fg:12ghk8d", "DATA:ghi:56kdv:128485hg|DATA:abc:de123fg:12ghk8d")
regmatches(str, regexpr("[^:]*:[^:]*", str))
#> [1] "DATA:abc" "DATA:ghi"
And the corresponding solution in stringr
, if you prefer:
library(stringr)
str_extract(str, "[^:]*:[^:]*")
#> [1] "DATA:abc" "DATA:ghi"
Created on 2019-12-03 by the reprex package (v0.3.0)
Upvotes: 2