Reputation: 13354
I have a dataframe (df) with a column (Col2) like this:
Col1 Col2 Col3
1 C607989_booboobear_Nation A
2 C607989_booboobear_Nation B
3 C607989_booboobear_Nation C
4 C607989_booboobear_Nation D
5 C607989_booboobear_Nation E
6 C607989_booboobear_Nation F
I want to extract just the number in Col2
Col1 Col2 Col3
1 607989 A
2 607989 B
3 607989 C
4 607989 D
5 607989 E
6 607989 F
I have tried things like:
gsub("^.*?_","_",df$Col2)
but it's not working.
Upvotes: 17
Views: 64250
Reputation: 887881
Or, you could use regex lookbehind
library(stringr)
str_extract(dat$Col2, perl('(?<=[A-Z])\\d+'))
#[1] "607989" "607989" "607989" "607989" "607989" "607989"
(?<=[A-Z])
Matches if the searched substring is preceded by a match for a capital letter of fixed length. In this case it is 1.
\\d+
the pattern/substring to be extracted are digits.
In the strings, this occurs only at C607989
_booboobear_Nation. So, it extracts only the digits that follows that pattern
Suppose you have a string like this:
v1 <- c(dat$Col2, "booboobear_D600078_Nation")
str_extract(v1, perl('(?<=[A-Z])\\d+'))
#[1] "607989" "607989" "607989" "607989" "607989" "607989" "600078"
still gets the number
Upvotes: 3
Reputation: 110054
An alternate approach using qdap::genXtract
that grabs strings between a left and right boundary. Here I use C
and _
for the left and right bounds:
## Your data in a better form for sharing
dat <- structure(list(Col1 = c("1", "2", "3", "4", "5", "6"), Col2 = c("C607989_booboobear_Nation",
"C607989_booboobear_Nation", "C607989_booboobear_Nation", "C607989_booboobear_Nation",
"C607989_booboobear_Nation", "C607989_booboobear_Nation"), Col3 = c("A",
"B", "C", "D", "E", "F")), .Names = c("Col1", "Col2", "Col3"), row.names = c(NA,
-6L), class = "data.frame")
library(qdap)
dat[[2]] <- unlist(genXtract(dat[[2]], "C", "_"))
dat
## Col1 Col2 Col3
## 1 1 607989 A
## 2 2 607989 B
## 3 3 607989 C
## 4 4 607989 D
## 5 5 607989 E
## 6 6 607989 F
Upvotes: 3
Reputation: 193687
If your string is not too fancy/complex, it might be easiest to do something like:
gsub("C([0-9]+)_.*", "\\1", df$Col2)
# [1] "607989" "607989" "607989" "607989" "607989" "607989"
Start with a "C", followed by digits, followed by an underscore and then anything else. Digits are captured with ()
, and the replacement is set to that capture group (\\1
).
Upvotes: 14