Reputation: 13354

Remove part of a string in dataframe column (R)

I have a dataframe (df) with a column (Col2) like this:

Col1                 Col2                   Col3
  1   C607989_booboobear_Nation               A
  2   C607989_booboobear_Nation               B
  3   C607989_booboobear_Nation               C
  4   C607989_booboobear_Nation               D
  5   C607989_booboobear_Nation               E
  6   C607989_booboobear_Nation               F

I want to extract just the number in Col2

Col1              Col2                    Col3
  1              607989                     A
  2              607989                     B
  3              607989                     C
  4              607989                     D
  5              607989                     E
  6              607989                     F

I have tried things like:

gsub("^.*?_","_",df$Col2)

but it's not working.

Upvotes: 17

Answers (3)

akrun

Reputation: 887881

Or, you could use regex lookbehind

library(stringr)
 str_extract(dat$Col2, perl('(?<=[A-Z])\\d+'))
 #[1] "607989" "607989" "607989" "607989" "607989" "607989"

(?<=[A-Z]) Matches if the searched substring is preceded by a match for a capital letter of fixed length. In this case it is 1.

\\d+ the pattern/substring to be extracted are digits.

In the strings, this occurs only at C607989_booboobear_Nation. So, it extracts only the digits that follows that pattern

Suppose you have a string like this:

 v1 <- c(dat$Col2, "booboobear_D600078_Nation")
 str_extract(v1, perl('(?<=[A-Z])\\d+'))
 #[1] "607989" "607989" "607989" "607989" "607989" "607989" "600078"

still gets the number

Upvotes: 3

Tyler Rinker

Reputation: 110054

An alternate approach using qdap::genXtract that grabs strings between a left and right boundary. Here I use C and _ for the left and right bounds:

## Your data in a better form for sharing
dat <- structure(list(Col1 = c("1", "2", "3", "4", "5", "6"), Col2 = c("C607989_booboobear_Nation", 
    "C607989_booboobear_Nation", "C607989_booboobear_Nation", "C607989_booboobear_Nation", 
    "C607989_booboobear_Nation", "C607989_booboobear_Nation"), Col3 = c("A", 
    "B", "C", "D", "E", "F")), .Names = c("Col1", "Col2", "Col3"), row.names = c(NA, 
    -6L), class = "data.frame")

library(qdap)
dat[[2]] <- unlist(genXtract(dat[[2]], "C", "_"))
dat

##   Col1   Col2 Col3
## 1    1 607989    A
## 2    2 607989    B
## 3    3 607989    C
## 4    4 607989    D
## 5    5 607989    E
## 6    6 607989    F

Upvotes: 3

A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

If your string is not too fancy/complex, it might be easiest to do something like:

gsub("C([0-9]+)_.*", "\\1", df$Col2)
# [1] "607989" "607989" "607989" "607989" "607989" "607989"

Start with a "C", followed by digits, followed by an underscore and then anything else. Digits are captured with (), and the replacement is set to that capture group (\\1).

Upvotes: 14

Remove part of a string in dataframe column (R)

Answers (3)

Related Questions