Alureon
Alureon

Reputation: 179

Subsetting a Data Frame Based on Character Ending

I'm a newbie to R. I've a data.frame. The beginning and end, respectively, look like:

Beginning of data frame End of data frame

What I'd like to do is to subset this data frame based on the last digit(s) of the "barcode" column. The digits go from 1 to 16, so there are 16 groups. I'd like to group these 16 into 5 groups. For example, all barcodes ending in "1" and "2" would be one subset of the data frame, all barcodes ending in "3", "4" and "5" would go into another subset, and so on.

I tried this using the which() and endsWith() functions:

my_frame = data.frame()
character_one = as.character(1)
subset_by_group_one <- my_frame[which(endsWith(my_frame, character_one)),]

However, I get the following error:

Error in endsWith(barcode_subset, character_one) : non-character object(s)

It seems that based on the R documentation, the endsWith() function must take in a character, and not a data frame. Yet I'd like to use it -- or something like it -- on my data frame to subset it. What's the best way to do this? Is there a way to coerce a data frame to a character? Will I need to use a loop to iterate through the data frame?

Upvotes: 1

Views: 4311

Answers (4)

moodymudskipper
moodymudskipper

Reputation: 47300

Actually the function you werelooking for is base::endsWith. It returns a vector of booleans and has the variable as the first argument.

df2 <- df1[endsWith(df1$z,"2"),]
#    z whatev
# 1 x2   blah
# 3 l2   blah
# 4 y2   blah
# 5 o2   blah
# 8 v2   blah

dplyr::ends_with was essentially made to be used inside dplyr calls, especially select, though we could make it work as well by being careful about argument order. It returns numeric values though in this case it doesn't make a difference.

library(dplyr)    
df2 <- df1[ends_with("2",vars=df1$z),]
#     z whatev
# 1  v2   blah
# 3  s2   blah
# 8  j2   blah
# 9  n2   blah
# 10 z2   blah

data

set.seed(1)
df1 <- data.frame(z = paste0(sample(letters,10),sample(1:3,10,T)),whatev="blah",stringsAsFactors=F)
#     z whatev
# 1  v2   blah
# 2  q3   blah
# 3  s2   blah
# 4  m1   blah
# 5  l1   blah
# 6  y1   blah
# 7  a1   blah
# 8  j2   blah
# 9  n2   blah
# 10 z2   blah

Ironically base::endsWith is much better suited for dplyr::filter calls than dplyr::ends_with.

Upvotes: 2

IRTFM
IRTFM

Reputation: 263301

I'm using Dan Hall's example. This builds a 'splitting/grouping vector" by removing all the material up to and including the dash, converts it to numeric and then groups it with findInterval. The grouping was somewhat unclear but you can modify the second argument to findInterval to adjust::

 grp <- findInterval( as.numeric( gsub("^.+[-]", "", my_frame$barcode)), 
                       c(.5, 2.5, 5.5, 8.5, 12.5, 16.5)) #split boundaries
> split( my_frame, grp)
$`1`
                            barcode other
TCGCGCGTTACATGT-1 TCGCGCGTTACATGT-1  blah
GCGTGTTATCCGCCT-2 GCGTGTTATCCGCCT-2  blah
CTCCCTCTTCTGTGC-1 CTCCCTCTTCTGTGC-1  blah
TTCTTGTGCGACAAA-2 TTCTTGTGCGACAAA-2  blah

$`2`
                            barcode other
CTTACGTCGTCAGCA-3 CTTACGTCGTCAGCA-3  blah
CCCATGTGTGACTAC-4 CCCATGTGTGACTAC-4  blah
GAGCCCAGAACTGTG-5 GAGCCCAGAACTGTG-5  blah
GTTGGCGAGCAGCAT-3 GTTGGCGAGCAGCAT-3  blah
ATTTAGGGGACCCAA-4 ATTTAGGGGACCCAA-4  blah
TGGCCAATGCGTTGA-5 TGGCCAATGCGTTGA-5  blah

$`3`
                            barcode other
TCCGTCCGGGGAGGA-6 TCCGTCCGGGGAGGA-6  blah
TTCAAATCGTCTACT-7 TTCAAATCGTCTACT-7  blah
AGGTACAATCTCGCA-8 AGGTACAATCTCGCA-8  blah
CGTGACTCCAATGGT-6 CGTGACTCCAATGGT-6  blah
CCGGGGGGTTGCCCC-7 CCGGGGGGTTGCCCC-7  blah
CTTTAAGTGTGTCAG-8 CTTTAAGTGTGTCAG-8  blah

$`4`
                              barcode other
TGCTGACAGTTAGAG-9   TGCTGACAGTTAGAG-9  blah
GGAAGGTGCAGAGGC-10 GGAAGGTGCAGAGGC-10  blah
AATTTAGGGCGGCCT-11 AATTTAGGGCGGCCT-11  blah
CCATCATGCGGGACG-12 CCATCATGCGGGACG-12  blah
TCCGAATCTGAGCAA-9   TCCGAATCTGAGCAA-9  blah
TCCCACCCTTTCTCG-10 TCCCACCCTTTCTCG-10  blah
CTCCTGGTCGCCACA-11 CTCCTGGTCGCCACA-11  blah
TCCCGCAACATGTAC-12 TCCCGCAACATGTAC-12  blah

$`5`
                              barcode other
TAAGAGTGCCAGTCC-13 TAAGAGTGCCAGTCC-13  blah
ACTCCACTGCCCAAC-14 ACTCCACTGCCCAAC-14  blah
CACCGTGGGTGCACA-15 CACCGTGGGTGCACA-15  blah
TGGGTGTCTGTCATG-16 TGGGTGTCTGTCATG-16  blah
CTGACATTGGTACAC-13 CTGACATTGGTACAC-13  blah
GCGCAGGTTCGAACC-14 GCGCAGGTTCGAACC-14  blah
TTTTTTCCCCCGACC-15 TTTTTTCCCCCGACC-15  blah
CCCAGCTGCCATTGA-16 CCCAGCTGCCATTGA-16  blah

Upvotes: 4

De Novo
De Novo

Reputation: 7600

You want to subset the rows according to a regular expression pattern (ends with "16", which is "16$". I think the most direct way to do this is with a logical vector that is true for those rows that end in 16. Produce the logical vector with grepl(pattern, x), where x is the column with the values you're interested in. Then subset according to rows by using that logical index vector in the row position of the subset expression my_frame[<index vector>,]. See how the data was simulated below. drop is set to FALSE in case the rownames are actually row names and not another column (and you don't have another column).

my_frame[grepl("16$", my_frame$barcode),, drop = FALSE]
#                               barcode other
# GACCTAAATGCCTGT-16 GACCTAAATGCCTGT-16  blah
# GAAATTGACATGACT-16 GAAATTGACATGACT-16  blah

Data:

barcode <- replicate(32, {
  paste(sample(c("T", "A", "C", "G"), 15, replace = TRUE), collapse = "")
})
barcode <- paste0(barcode, "-", 1:16)
my_frame <- data.frame(row.names = barcode, barcode = barcode, other = rep("blah", 32), stringsAsFactors = FALSE)
head(my_frame)
#                             barcode other
# GTCCGGTGATGATAA-1 GTCCGGTGATGATAA-1  blah
# CTGCTACATATAGAA-2 CTGCTACATATAGAA-2  blah
# GTGACCGTGGTCGAA-3 GTGACCGTGGTCGAA-3  blah
# TCTAGGACGATTACT-4 TCTAGGACGATTACT-4  blah
# GAGGGAGGCGTCCAT-5 GAGGGAGGCGTCCAT-5  blah
# CAGCAGCCTCCACCG-6 CAGCAGCCTCCACCG-6  blah

Upvotes: 4

neilfws
neilfws

Reputation: 33772

I'd use a regular expression to extract the endings, then join with a data frame containing the group information.

Some example data:

library(tidyverse)
df1 <- data.frame(x = paste0("AAA-", 1:16))

Some example groups: 1-2 = 1; 3-5 = 2; 6-9 = 3; 10-14 = 4; 15-16 = 5.

Join with df1:

df1 %>% 
  mutate(suffix = str_match(x, "-(\\d+)$")[, 2] %>% as.numeric()) %>%
  left_join(data.frame(suffix = 1:16, 
                       group = c(1,1,2,2,2,3,3,3,3,4,4,4,4,4,5,5)))

        x suffix group
1   AAA-1      1     1
2   AAA-2      2     1
3   AAA-3      3     2
4   AAA-4      4     2
5   AAA-5      5     2
6   AAA-6      6     3
7   AAA-7      7     3
8   AAA-8      8     3
9   AAA-9      9     3
10 AAA-10     10     4
11 AAA-11     11     4
12 AAA-12     12     4
13 AAA-13     13     4
14 AAA-14     14     4
15 AAA-15     15     5
16 AAA-16     16     5

Upvotes: 1

Related Questions