NewUsr_stat
NewUsr_stat

Reputation: 2571

Strsplit and count occurrences

Is there a way to split a string like this?

A1BG\tAAAGGGCGTTCACCGG\t2 A1BG\tAAGATAGCATCCCACT\t1

I would like to split by "\" in order to count how many genes are in the file where a gene is in this case A1BG and how many codes are where codes are for example AAAGGGCGTTCACCGG and AAGATAGCATCCCACT. My attempt below hasn't been successful.

strsplit(mydf, '\')[[1]]

Can anyone help me please?

Upvotes: 2

Views: 788

Answers (3)

Frank
Frank

Reputation: 66819

It looks like you have a malformed TSV (tab-separated values) table. If you swap the spaces for newlines, you can read it in as a table and don't need to set up your own parsing rules:

x <- "A1BG\tAAAGGGCGTTCACCGG\t2 A1BG\tAAGATAGCATCCCACT\t1"
x2 <- gsub(" ", "\n", x)

library(data.table)
DT = setnames(fread(x2), c("gene", "code", "num"))[]

#    gene             code num
# 1: A1BG AAAGGGCGTTCACCGG   2
# 2: A1BG AAGATAGCATCCCACT   1

Then you can count how many codes there are per gene like

DT[, .N, by=gene]
# or 
DT[, .(N = uniqueN(code)), by=gene]

#    gene N
# 1: A1BG 2

or similarly use dplyr's count and n_distinct functions.

Upvotes: 4

akrun
akrun

Reputation: 886938

We can use str_count

library(stringr)
str_count(str1, "[ACGT]{16}")
#[1] 2

If we are splitting, then split at tab (\t)

strsplit(str1, "\t")

data

str1 <- "A1BG\tAAAGGGCGTTCACCGG\t2 A1BG\tAAGATAGCATCCCACT\t1"

Upvotes: 2

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520878

We can try matching on the regex pattern \b[ACGT]{16}\b, and then counting the number of matches in the input string:

x <- "A1BG\tAAAGGGCGTTCACCGG\t2 A1BG\tAAGATAGCATCCCACT\t1"
matches <- regmatches(x, gregexpr("\\b[ACGT]{16}\\b", x, perl=TRUE))[[1]]
length(matches)

[1] 2

If the number of base pairs in a gene might not be exactly 16, then try choosing a gene length which would result in the correct count in that case (e.g. between 10 and 20 base pairs).

Upvotes: 4

Related Questions