user1774225
user1774225

Reputation: 71

Remove rows of a data set belonging to a factor of specified length

I have a data.frame similar to the following:

df <- data.frame(population = c("AA","AA","AA","BB","BB","CC","CC","CC"),
                 individual = c("A1","A2","A3","B1","B2","C1","C2","C3"),
                 Haplotype1 = rep(1:4,2),
                 Haplotype2 = rep(5:8,2))
 > df
  population individual Haplotype1 Haplotype2
1         AA         A1          1          5
2         AA         A2          2          6
3         AA         A3          3          7
4         BB         B1          4          8
5         BB         B2          1          5
6         CC         C1          2          6
7         CC         C2          3          7
8         CC         C3          4          8

I want to create a new dataset where any population consisting of less than a specified number of individuals is omitted from the dataset. For example, I want to reanalyze the data for only populations with greater than three or more individuals. This following is the dataset I want:

> df <- df[!df$population=="BB",]
> df
  population individual Haplotype1 Haplotype2
1         AA         A1          1          5
2         AA         A2          2          6
3         AA         A3          3          7
6         CC         C1          2          6
7         CC         C2          3          7
8         CC         C3          4          8

However, I have 400 populations ranging in size from 5 to 155 individuals, and manually picking populations out by name is not feasible. I want to write a function where I say in essence "give me a dataset with all populations consisting of X number of individuals or more and delete those with less than X." Any help or feedback is appreciated.

Upvotes: 5

Views: 938

Answers (3)

Tyler Rinker
Tyler Rinker

Reputation: 110054

This would work as well:

lens <- tapply(df$population , df$population, length)
df[df$population %in% names(lens)[lens > 2], ]

EDIT: Per mrdwab's sharp reading I have edited my answer. I must admit I looked at the input and output only:

lens <- tapply(df$individual, df$population, function(x) length(unique(x)))
df[df$population %in% names(lens)[lens > 2], ]

Upvotes: 3

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

The most direct approach I can think of is to use data.table() from the "data.table" package:

library(data.table)
DT <- data.table(population = c("AA","AA","AA","BB","BB","CC","CC","CC"),
                 individual = c("A1","A2","A3","B1","B2","C1","C2","C3"),
                 Haplotype1 = rep(1:4,2), Haplotype2 = rep(5:8,2),
                 key = "population")
## Or, convert your existing data.frame "df" to data.table:
## DT <- data.table(df, key = "population")
DT[, .SD[length(unique(individual)) >= 3], by = key(DT)]
#    population individual Haplotype1 Haplotype2
# 1:         AA         A1          1          5
# 2:         AA         A2          2          6
# 3:         AA         A3          3          7
# 4:         CC         C1          2          6
# 5:         CC         C2          3          7
# 6:         CC         C3          4          8

Update

I'm not sure if this is important to you or not, but note that with Tyler's and Sven's current solutions, although the output is correct according to the data in the question you've posted, there is actually some potentially flawed thinking going on.

I write "potentially" because you mention that you're looking for groups (from df$population) where there are three or more individuals (from df$individual). However, both of their solutions currently only look at the lengths of population, while by your actual question I would have assumed that you would want the number of unique individuals mentioned by population.

Here's a simple example. Using your original "df", change the individual in row 3 to "A2" (df[3, 2] <- "A2"). Now, according to your criteria in your question, only rows with population == "CC" should be returned.

If your data already only has unique individuals, then no problem--but I thought I would mention it ;)


A base R solution that keeps this logic into account is:

uniqueIndividuals <- ave(as.character(df$individual), 
                         df$population, FUN = function(x) length(unique(x)))
df[which(as.numeric(uniqueIndividuals) >= 3), ]

Upvotes: 3

Sven Hohenstein
Sven Hohenstein

Reputation: 81743

This should do the trick:

tab <- table(df$population) > 2
df[df$population %in% names(tab)[tab], ]

#   population individual Haplotype1 Haplotype2
# 1         AA         A1          1          5
# 2         AA         A2          2          6
# 3         AA         A3          3          7
# 6         CC         C1          2          6
# 7         CC         C2          3          7
# 8         CC         C3          4          8

Upvotes: 4

Related Questions