Reputation: 373
I have these columns:
text.NANA text.22 text.32
1 Female RNDM_MXN95.tif No NA
12 Male RNDM_QOS38.tif No NA
13 Female RNDM_WQW90.tif No NA
14 Male RNDM_BKD94.tif No NA
15 Male RNDM_LGD67.tif No NA
16 Female RNDM_AFP45.tif No NA
I want to create a column that only has the barcode that starts with RNDM_
and ends with .tif
, but not including .tif
. The tricky part is to get rid of the gender information that is also in the same column. There are a random amount of spaces between the gender information and the RNDM_
:
text.NANA text.22 text.32 BARCODE
1 Female RNDM_MXN95.tif No NA RNDM_MXN95
12 Male RNDM_QOS38.tif No NA RNDM_QOS38
13 Female RNDM_WQW90.tif No NA RNDM_WQW90
14 Male RNDM_BKD94.tif No NA RNDM_BKD94
15 Male RNDM_LGD67.tif No NA RNDM_LGD67
16 Female RNDM_AFP45.tif No NA RNDM_AFP45
I made a very poor attempt with this, but it didn't work:
dfrm$BARCODE <- regexpr("RNDM_", dfrm$text.NANA)
# [1] 8 6 9 7 7 8 9 9 8 8 9 9 6 6 7 8 9 8
# attr(,"match.length")
# [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# attr(,"useBytes")
# [1] TRUE
Please help. Thanks!
Upvotes: 0
Views: 1103
Reputation: 545598
So you just want to remove the file extension? Use file_path_sans_ext
:
dfrm$BARCODE = file_path_sans_ext(dfrm$text.NANA)
If there’s more stuff in front, you can use the following regular expression to extract just the suffix:
dfrm$BARCODE = stringr::str_match(dfrm$text.NANA, '(RNDM_.*)\\.tif')[, 2]
Note that I’m using the {stringr} package here because the base R functions for extracting regex matches are terrible. Nobody uses them.
I strongly recommend against using strsplit
here because it’s underspecified: from reading the code it’s absolutely not clear what the purpose of that code is. Write code that is self-explanatory, not code that requires explanation in a comment.
Upvotes: 2
Reputation: 302
You can use sapply() and strsplit to do it easy, let me show you:
sapply(strsplit(dfrm$text.NANA, "_"),"[", 1)
That should work.
Edit:
sapply(strsplit(x, "[ .]+"),"[", 2)
Upvotes: 0