Reputation: 2571
Is there a way to split a string like this?
A1BG\tAAAGGGCGTTCACCGG\t2 A1BG\tAAGATAGCATCCCACT\t1
I would like to split by "\" in order to count how many genes are in the file where a gene is in this case A1BG and how many codes are where codes are for example AAAGGGCGTTCACCGG and AAGATAGCATCCCACT. My attempt below hasn't been successful.
strsplit(mydf, '\')[[1]]
Can anyone help me please?
Upvotes: 2
Views: 788
Reputation: 66819
It looks like you have a malformed TSV (tab-separated values) table. If you swap the spaces for newlines, you can read it in as a table and don't need to set up your own parsing rules:
x <- "A1BG\tAAAGGGCGTTCACCGG\t2 A1BG\tAAGATAGCATCCCACT\t1"
x2 <- gsub(" ", "\n", x)
library(data.table)
DT = setnames(fread(x2), c("gene", "code", "num"))[]
# gene code num
# 1: A1BG AAAGGGCGTTCACCGG 2
# 2: A1BG AAGATAGCATCCCACT 1
Then you can count how many codes there are per gene like
DT[, .N, by=gene]
# or
DT[, .(N = uniqueN(code)), by=gene]
# gene N
# 1: A1BG 2
or similarly use dplyr's count
and n_distinct
functions.
Upvotes: 4
Reputation: 886938
We can use str_count
library(stringr)
str_count(str1, "[ACGT]{16}")
#[1] 2
If we are splitting, then split at tab (\t
)
strsplit(str1, "\t")
str1 <- "A1BG\tAAAGGGCGTTCACCGG\t2 A1BG\tAAGATAGCATCCCACT\t1"
Upvotes: 2
Reputation: 520878
We can try matching on the regex pattern \b[ACGT]{16}\b
, and then counting the number of matches in the input string:
x <- "A1BG\tAAAGGGCGTTCACCGG\t2 A1BG\tAAGATAGCATCCCACT\t1"
matches <- regmatches(x, gregexpr("\\b[ACGT]{16}\\b", x, perl=TRUE))[[1]]
length(matches)
[1] 2
If the number of base pairs in a gene might not be exactly 16, then try choosing a gene length which would result in the correct count in that case (e.g. between 10 and 20 base pairs).
Upvotes: 4