anikaM
anikaM

Reputation: 429

How to extract number from character string?

I have a dataframe like this:

    > dns1
               variant_id         gene_id pval_nominal
21821  chr1_165656237_T_C_b38 ENSG00000143149  1.24119e-05
21822 chr1_165659346_C_CA_b38 ENSG00000143149  1.24119e-05
21823  chr1_165659350_A_G_b38 ENSG00000143149  1.24119e-05
21824  chr1_165659415_A_G_b38 ENSG00000143149  1.24119e-05
21825  chr1_165660430_T_C_b38 ENSG00000143149  1.24119e-05
21826  chr1_165661135_T_G_b38 ENSG00000143149  1.24119e-05
21827  chr1_165661238_C_T_b38 ENSG00000143149  1.24119e-05
...

I would like to remove all characters from the 2nd column (variant_id) and to extract just the second number, to look like this:

165656237
165659346
165659350
165659415
165660430
165661135
165661238
...

I tried this:

dns1$variant_id <- gsub('[^0-9.]', '', dns1$variant_id)

but with the above command I am getting this:

> dns1
      variant_id         gene_id pval_nominal
21821    116565623738 ENSG00000143149  1.24119e-05
21822    116565934638 ENSG00000143149  1.24119e-05
21823    116565935038 ENSG00000143149  1.24119e-05
21824    116565941538 ENSG00000143149  1.24119e-05
...

So this matches all numbers in variant_id column, and I would need to get 16565623738 instead of 116565623738. So the question is how to match in this 2nd column just the 2nd number?

Upvotes: 6

Views: 14220

Answers (6)

zx8754
zx8754

Reputation: 56249

Using utils::strcapture we can extract all parts of the variant ID, including the genomic position.

# example input
x <- c("chr1_165656237_T_C_b38", "chr1_165659346_C_CA_b38")

# get pattern for each part
pattern <- "(.*?)_([[:digit:]]+)_([A-Z]+)_([A-Z]+)_(b[0-9]+)"

# empty dataframe with columns to match after split
proto <- data.frame(chrom = character(), position = integer(), 
                    allele1 = character(), allele2 = character(), build = character())

# extract
strcapture(pattern, x, proto)
#   chrom  position allele1 allele2 build
# 1  chr1 165656237       T       C   b38
# 2  chr1 165659346       C      CA   b38

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627468

You may use

dns1$variant_id <- sub('^[^_]*_(\\d+).*', '\\1', dns1$variant_id)

See the regex demo

Details

  • ^ - start of string
  • [^_]* - 0+ chars other than _
  • _ - an underscore
  • (\\d+) - Group 1: one or more digits
  • .* - the rest of the string.

The sub function will only perform a single search and replace operation on each string, and the \1 backreference in the replacement will put back the contents in Group 1.

Online R demo:

variant_id <- c("chr1_165656237_T_C_b38", "chr1_165659346_C_CA_b38")
dns1 <- data.frame(variant_id)
dns1$variant_id <- sub('^[^_]*_(\\d+).*', '\\1', dns1$variant_id)
dns1
##=> variant_id
## 1  165656237
## 2  165659346

Upvotes: 9

akash87
akash87

Reputation: 3994

You can use

dns$variant_id_new <- sapply(strsplit(as.character(dns$variant_id), "_"), unlist)[2,]

Logically, this first splits all strings in variant_id by the _. The sapply(,unlist) turns this into a matrix, where we take the second row (for the second variable).

Upvotes: 1

g_t_m
g_t_m

Reputation: 714

Here's an option using stringr:

library(stringr)

df <-
  data.frame(variant_id = c("chr1_165656237_T_C_b38",
                            "chr1_165659346_C_CA_b38",
                            "chr1_165659350_A_G_b38",
                            "chr1_165659415_A_G_b38",
                            "chr1_165660430_T_C_b38",
                            "chr1_165661135_T_G_b38",
                            "chr1_165661238_C_T_b38"))

df$variant_id_extract <-
  str_replace(df$variant_id, "^.+_(\\d+)_.+$", "\\1")

df
#>                variant_id variant_id_extract
#> 1  chr1_165656237_T_C_b38          165656237
#> 2 chr1_165659346_C_CA_b38          165659346
#> 3  chr1_165659350_A_G_b38          165659350
#> 4  chr1_165659415_A_G_b38          165659415
#> 5  chr1_165660430_T_C_b38          165660430
#> 6  chr1_165661135_T_G_b38          165661135
#> 7  chr1_165661238_C_T_b38          165661238

Upvotes: 2

Russ Hyde
Russ Hyde

Reputation: 2269

I believe you can catch the digits as follows:

gsub(".*?_([[:digit:]]+)_.*", "\\1", dns1$variant_id)

Upvotes: 5

Joseph Clark McIntyre
Joseph Clark McIntyre

Reputation: 1094

Here is a super hacky solution which uses both gsub and str_replace (from stringr). I'm sure there are better solutions, and this requires that variant_id always begins chr1_, which may not be fair.

dns1$variant_id <- gsub('_(.*)','', str_replace(dns1$variant_id, 'chr1_',''))

Upvotes: 1

Related Questions