Reputation: 269
I'm working with some survey data (collected and logged by someone else) that is in a strange format. It records species abundance on survey transects, but it only lists the species observed during a given transect, and not all of the possible species recorded. I spent some time figuring out how to re-shape the data using tidyr so that I have columns for every species during each survey, and species not recorded are filled with 0's. Here's a short, reproducible example:
#This works:
Survey <- as.factor(c(rep("Survey 1",10),rep("Survey 2",10),rep("Survey 3",10)))
Species <- as.factor(c(c("A","B","C","D","E","U","V","W","X","Y"),c("A","C","E","G","I","K","M","O","Q","S"),c("B","D","F","H","J","L","N","P","R","T")))
Abundance <- ceiling(runif(30,1,50))
working.df<-cbind.data.frame(Survey,Species,Abundance)
working.spread<-working.df %>%
group_by(Survey) %>%
spread(Species,Abundance,drop=F,fill=0)
Unfortunately, the real data aren't this simple. In some cases they logged multiple lines of the same species within a single survey so that they could record information on an additional variable that I'm not interested in. I just care about the total abundance per survey. So this is an example of what the real data might look like (note the double "A" at the start of Species2):
#This doesn't work:
Species2 <- as.factor(c(c("A","A","C","D","E","U","V","W","X","Y"),c("A","C","E","G","I","K","M","O","Q","S"),c("B","D","F","H","J","L","N","P","R","T")))
not.working.df<-cbind.data.frame(Survey,Species2,Abundance)
not.working.spread<-not.working.df %>%
group_by(Survey) %>%
spread(Species2,Abundance,drop=F,fill=0)
So, when there are two of the same species listed, the spread argument no longer works, and returns the familiar error:
Error: Duplicate identifiers for rows (1, 2)
And in the true dataset I get an error with quite a few of these duplicates (and this is just one of several datasets), so I don't want to go through and fix this manually, of course:
Error: Duplicate identifiers for rows (206, 216), (1532, 1544), (1052, 1595), (1324, 1330), (191, 212), (194, 211), (1392, 1600), (19, 37), (1404, 1599), (199, 215), (1073, 1596), (1074, 1597), (43, 44, 45), (455, 456), (380, 381, 382, 383), (447, 448), (413, 414, 415, 416, 417, 418), (303, 304), (1015, 1016), (897, 898, 1593), (1306, 1307), (1041, 1594), (1076, 1598), (1425, 1426), (49, 64), (198, 214)
What I would like to do is sum the Abundance field across duplicate identifiers. I know there are similar questions on here and I've pored over many of them, but I haven't found a solution to this yet. I've worked hard on getting to this point with spread and it seems like I'm one simple function command away from getting this to work... any advice would be much appreciated. Or if I completely missed an existing answer to this problem, please point me in that direction.
Cheers
Upvotes: 0
Views: 848
Reputation: 269
Thanks, aosmith, for pointing me in the direction of the summarize thread—that did the trick. Here's the working solution:
not.working.spread<-not.working.df %>%
group_by(Survey,Species2) %>%
summarize(Abundance = sum(Abundance)) %>%
spread(Species2,Abundance,drop=F,fill=0)
Upvotes: 1