Reputation: 73
My question seems to be a very common question, but the solutions I found on internet don't work...
I would like to aggregate rows in a data frame in R. Here is the structure of my data frame (df), it is a table of citations :
Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Burgess E 1 NA NA NA
Kaufmann V NA 1 NA NA
Kaufmann V NA NA 1 NA
Kaufmann V NA NA NA 1
Orfeuil P 1 NA NA NA
Orfeuil P NA 1 NA NA
Sorokin P NA NA NA 1
That is I would like to have :
Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 4 NA NA NA
Kaufmann V NA 1 1 1
Orfeuil P 1 1 NA NA
Sorokin P NA NA NA 1
I have tried those solutions, but it doesn't work :
ddply(df,"Autors", numcolwise(sum))
and
df %>% group_by(Autors) %>% summarize_all(sum)
It aggregates well the rows, but the values (sum of the 1 values) are absolutely not correct ! And I don't understand why...
Do you have an idea ?
Thank you very much !
Joël
Upvotes: 1
Views: 2871
Reputation: 51
You can also do the summing using rowsum(), although it (perhaps misleadingly) gives sums of 0 rather than NA for cells in the output that had only NA's for input.
rowsum(df[,c(2:5)],df$Autors,na.rm=T)
Gives:
Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
Burgess E 4 0 0 0
Kaufmann V 0 1 1 1
Orfeuil P 1 1 0 0
Sorokin P 0 0 0 1
Upvotes: 3
Reputation: 887108
It could be because the na.rm
is not used
library(dplyr)
df %>%
group_by(Autors) %>%
summarize_all(sum, na.rm = TRUE)
if both plyr
and dplyr
are loaded, summarise
would get masked, but doubt about summarise_all
as it is a dplyr
function
Based on the expected output, with na.rm = TRUE
, it removes all NA
s and if there are cases having only NA
s it returns 0. To avoid that, we can have a condition
df %>%
group_by(Autors) %>%
summarize_all(funs(if(all(is.na(.))) NA else sum(., na.rm = TRUE)))
# A tibble: 4 x 5
# Autors Lannoy_2016 Ramadier_2014 Lord_2009 Ortar_2008
# <chr> <int> <int> <int> <int>
#1 Burgess E 4 NA NA NA
#2 Kaufmann V NA 1 1 1
#3 Orfeuil P 1 1 NA NA
#4 Sorokin P NA NA NA 1
df <- structure(list(Autors = c("Burgess E", "Burgess E", "Burgess E",
"Burgess E", "Kaufmann V", "Kaufmann V", "Kaufmann V", "Orfeuil P",
"Orfeuil P", "Sorokin P"), Lannoy_2016 = c(1L, 1L, 1L, 1L, NA,
NA, NA, 1L, NA, NA), Ramadier_2014 = c(NA, NA, NA, NA, 1L, NA,
NA, NA, 1L, NA), Lord_2009 = c(NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA), Ortar_2008 = c(NA, NA, NA, NA, NA, NA, 1L, NA, NA, 1L
)), .Names = c("Autors", "Lannoy_2016", "Ramadier_2014", "Lord_2009",
"Ortar_2008"), class = "data.frame", row.names = c(NA, -10L))
Upvotes: 1