Reputation: 727
I've searched for an answer to this question for awhile and haven't found this exact test case. Please accept my apologies if it's been solved elsewhere.
I have a large dataframe data
of biological observations (~2,500,000 rows, ~50 columns) from a government survey program. The important columns are reg
(region), species
, and year
(of the survey).
I would like to subset the dataframe to contain only the species that were found at least once per year, in each region. (I also need to do this for species found at least 10 times per year, but I assume the solution is the same.)
In other words, I need to filter out only the rows for which a species
is present in every year that a region
is also present - essentially, all the rows where unique(year)
for each species
matches unique(year)
for each reg
. (There are only nine regions, so I could certainly subset the data into nine dataframes and repeat this filtering nine times over, but I'm hoping there is a more elegant solution using only dplyr
that I can apply directly to data
.)
Thank you!
EDIT: I created a mock dataset as suggested by @aichao.
> year = c(1999,2000,2000,2004,2004,2008)
> reg = c('ai','ci','ci','ai','ai','ci')
> species = c('blueei','greenei','yellowei','blueei','greenei','yellowei')
> df <- data.frame(year, reg, species)
> df
year reg species
1 1999 ai blueei
2 2000 ci greenei
3 2000 ci yellowei
4 2004 ai blueei
5 2004 ai greenei
6 2008 ci yellowei
So in this df, I want to keep only the species that were found every time the region was surveyed (let's pretend this is the complete data). From ai
, I want to keep blueei
- which was found in every survey year - but not greenei
. Similarly, in ci
, I want to keep yellowei
but not greenei
, since greenei
did not turn up in 2008.
Upvotes: 2
Views: 1354
Reputation: 578
You can create an id
variable in your original data frame representing the region + species pair.
year = c(1999,2000,2000,2004,2004,2008)
reg = c('ai','ci','ci','ai','ai','ci')
species = c('blueei','greenei','yellowei','blueei','greenei','yellowei')
df <- data.frame(year, reg, species) %>%
mutate(ids = paste(reg, species, sep='-'))
Then create a new data frame that identifies, for each region
, the total number of year
s, how many year
s each species
was present, and which species were present in all years.
df2 <- df %>%
group_by(reg) %>%
mutate(n_yrs = length(unique(year))) %>%
group_by(reg, ids) %>%
summarize(present_yrs = length(unique(year)),
all_yrs = mean(n_yrs, na.rm = T)) %>%
filter(present_yrs == all_yrs)
Lastly, filter your old dataset to only include the id
's in df2
filter(df, ids %in% df2$ids)
Upvotes: 1
Reputation: 7445
to keep only the species that were found every time the region was surveyed
We can do the following using dplyr
:
library(dplyr)
result <- df %>% group_by(reg) %>% mutate(num.years.in.reg = length(unique(year))) %>%
group_by(reg,species) %>% filter(length(unique(year)) == first(num.years.in.reg)) %>%
select(-num.years.in.reg) %>% arrange(reg)
Notes:
group_by
reg
and create a column num.years.in.reg
with the number of unique
years that region is surveyed.group_by
both reg
and species
and keep only those species
for which the number of unique
years that the species
is surveyed in the region is equal to the number of unique
years that region is surveyed. Here, first
just retrieves the first value from num.years.in.reg
since they will be the same for all rows in the group.num.years.in.reg
column and sort the result
by reg
.With the data you posted, the result
is:
print(result)
##Source: local data frame [4 x 3]
##Groups: reg, species [2]
##
## year reg species
## <dbl> <chr> <chr>
##1 1999 ai blueei
##2 2004 ai blueei
##3 2000 ci yellowei
##4 2008 ci yellowei
Upvotes: 1