Reputation: 451
I have a data.frame with many columns (~50). Some of them are character, some are numeric and 3 of them I use for grouping.
I need to:
Let's say, we're using modified iris data as below:
data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris[1,] <- NA ## introducing NAs
I have ~50 columns in total, numeric and character mixed together. I've been trying something like:
giris <- group_by(iris, Species, year)
cls <- unlist(sapply(giris, class)) ## find out classes
action <- ifelse(cls == "numeric", "mean", "first")
action <- paste(action)
summarise_each(giris, action)
What I get is means for all columns in a group followed by columns with the first values in respective group. And NAs are not handled... Which is not exactly what I seek...
Help anyone?
Upvotes: 13
Views: 8390
Reputation: 70296
You could try this with an if
/else
in the funs
of summarise_each
:
iris %>%
group_by(Species, year) %>%
summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))
Since you have some NA's also in grouping columns, you could add a filter
statement:
iris %>%
filter(!is.na(Species) & !is.na(year)) %>%
group_by(Species, year) %>%
summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))
#Source: local data frame [6 x 7]
#Groups: Species [?]
#
# Species year Sepal.Length Sepal.Width Petal.Length Petal.Width color
# (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (chr)
#1 setosa 2000 5.025 3.479167 1.4625 0.250 red
#2 setosa 3000 4.984 3.376000 1.4640 0.244 red
#3 versicolor 2000 6.012 2.776000 4.3120 1.344 green
#4 versicolor 3000 5.860 2.764000 4.2080 1.308 green
#5 virginica 2000 6.576 2.928000 5.6400 2.044 blue
#6 virginica 3000 6.600 3.020000 5.4640 2.008 blue
To avoid potential NA's in the color column (or any non-numeric columns), you could modify it to first(na.omit(.))
.
You could also try data.table
:
library(data.table)
setDT(iris)
iris[!is.na(Species) & !is.na(year), lapply(.SD, function(x) {
if(is.numeric(x)) mean(x, na.rm = TRUE) else x[!is.na(x)][1L]}),
by = list(Species, year)]
# Species year Sepal.Length Sepal.Width Petal.Length Petal.Width color
#1: setosa 2000 5.025 3.479167 1.4625 0.250 red
#2: setosa 3000 4.984 3.376000 1.4640 0.244 red
#3: versicolor 2000 6.012 2.776000 4.3120 1.344 green
#4: versicolor 3000 5.860 2.764000 4.2080 1.308 green
#5: virginica 2000 6.576 2.928000 5.6400 2.044 blue
#6: virginica 3000 6.600 3.020000 5.4640 2.008 blue
Upvotes: 18
Reputation: 1327
I give it a try:
1. For the first point you mention, I would do something like the following (which isn't necessary for the second point):
na.omit(iris[ , which(sapply(iris, class) == "numeric")])
To separate the columns bei either numeric
or character
, I use the following:
iris[ , which(sapply(iris, class) == "numeric")]
iris[ , which(sapply(iris, class) == "character")]
2.
The second task I combine the above line with colMeans
:
colMeans(iris[ , which(sapply(iris, class) == "numeric")], na.rm = TRUE)
3. To extract the first element of the character columns, you could simply do:
iris[1, which(sapply(iris, class) == "character")]
In the mentioned case for the iris data, the first row is completly NA, even the character columns, so I would iterate to find the first-non-NA-row
k <- 1
while(any(is.na(FirstCharacterElement <- iris[k, which(sapply(iris, class) == "character")]))){
k <- k + 1
}
Be careful about the class factor (which break up the code in the case of the iris data where column Species is of class factor and you maybe expect it to be a character column. You can check this with sapply(iris, class)
and change it with e.g.
iris$Species <- as.character(iris$Species) #or with similar column names
When you read in the data you can mention the parameter stringsAsFactors = FALSE
of functions read.table
, read.csv
or similar.
Upvotes: 0