Filter grouped row by highest occurrence of string with dplyr

Question

I am working on collapsing down a transcriptomics dataset from transcript to gene level for a downstream analysis. Within this dataset, each row has a unique gene identifier (qry_gene_id) and each qry_gene_id can have multiple qry_transcript_ids. I would like to filter the dataset to select the qry_transcript_id from each qry_gene_id that has the greatest number of go_id (GO:XXXXXXX). The go_id column is a list of go_ids separated by ",".

Here is a subset my data:

structure(list(qry_transcript_id = c("TU22", "TU20", "TU27", 
"TU29", "TU25", "TU26", "TU28", "TU31", "TU24", "TU30"), go_id = c(NA, 
NA, "GO:0004672,GO:0005515,GO:0005524,GO:0006468", "GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169", 
"GO:0005003,GO:0005515,GO:0005524,GO:0005887,GO:0006468,GO:0007169"
), ref_gene_id = c("LOC108906571", "LOC108906589", "LOC108906588", 
"LOC108906588", "LOC108906588", "LOC108906588", "LOC108906588", 
"LOC108906588", "LOC108906588", "LOC108906588"), qry_gene_id = c("G10", 
"G9", "G12", "G12", "G12", "G12", "G12", "G12", "G12", "G12"), 
    ref_gene_name = c("uncharacterized LOC108906571", "uncharacterized LOC108906589", 
    "ephrin type-B receptor 1-B", "ephrin type-B receptor 1-B", 
    "ephrin type-B receptor 1-B", "ephrin type-B receptor 1-B", 
    "ephrin type-B receptor 1-B", "ephrin type-B receptor 1-B", 
    "ephrin type-B receptor 1-B", "ephrin type-B receptor 1-B"
    ), gene_annotation = c("refseq", "refseq", "refseq", "refseq", 
    "refseq", "refseq", "refseq", "refseq", "refseq", "refseq"
    ), ref_transcript_id = c("XM_018709871.1", "XM_018709894.2", 
    "XM_018709891.1", "XM_018709891.1", "XM_018709891.1", "XM_018709891.1", 
    "XM_018709891.1", "XM_018709891.1", "XM_018709891.1", "XM_018709891.1"
    ), ref_transcript_name = c("uncharacterized LOC108906571", 
    "uncharacterized LOC108906589", "ephrin type-B receptor 1-B, transcript variant X2", 
    "ephrin type-B receptor 1-B, transcript variant X2", "ephrin type-B receptor 1-B, transcript variant X2", 
    "ephrin type-B receptor 1-B, transcript variant X2", "ephrin type-B receptor 1-B, transcript variant X2", 
    "ephrin type-B receptor 1-B, transcript variant X2", "ephrin type-B receptor 1-B, transcript variant X2", 
    "ephrin type-B receptor 1-B, transcript variant X2"), class_code = c("i", 
    "k", "j", "j", "=", "j", "j", "j", "j", "j")), row.names = 21:30, class = "data.frame")

As you can see for qry_gene_id = G12, the first transcript is missing a couple of GO ids. I want to make sure that my filter selects a transcript that has the full compliment of GO ids.

However, I'm stuck on how to filter this appropriately. Here's where I'm at.

test_data <- test_data %>% group_by(qry_gene_id) %>% filter()

It seems logical to me that filtering by either 1) total length of that string (which I think should capture the longest list of GO terms) or 2) counting occurrences of a string (e.g. "GO") and selecting the ones with the highest count of "GO". Basically I want to end up not leaving out any of the GO terms associated with each gene.

Gregor Thomas · Accepted Answer

Here's an approach to keep the rows with the highest count of "GO" in each group:

library(dplyr)
library(stringr)
test_data %>% 
  mutate(go_count = str_count(go_id, "GO")) %>%
  group_by(qry_gene_id) %>% 
  slice_max(go_count)

See ?slice_max in case you want to fine tune this, e.g., adjust what happens when there are ties. The default will keep all rows tied for the most occurrences of "GO" within a group.

You could also use something like filter(which.max(nchar(go_id))), keeping the maximum number of characters.

Filter grouped row by highest occurrence of string with dplyr

Answers (1)

Related Questions