Reputation: 716
I'm dipping my toes into R
and am looking to do something that seems fairly straightforward, but I'm a little lost as to how to proceed. I'm reading a CSV that looks something like this:
translation,category,macrocategory,subcategory,1640,1641,1642,1643,1644
almonds,nuts,Flavoring/Other,,,491,,,
apples,Fruit,Fruits and Vegetables,42,,67,,,
Atlantic herring,Fish,Meat,,52,0,9,,
aurochs,Meat,Meat,game,,4,25,5,
bacon,Meat,Meat,pork,,275.87,78,92,0
barley groats,Grain,Grain,5,9,2,14,56,9
beef,Meat,Meat,Beef,,5.25,,,
What I'd like to have is to add new columns for the maximum, minimum, and mean for each food item (so, max, min, and mean for each row). I've handled some of the basic cleanup, but I'm a bit lost as to how to proceed from there.
library(dplyr)
library(tidyr)
df <- read.csv("foods.csv", sep=",", header = T)
food.clean <- data.frame(foodname=df[,1], data.matrix(df[,5:53]))
str(food.clean) # check
food.clean <- food.clean[rowSums(is.na(food.clean)) < 48, ]
summary(food.clean)
I also tried using the matrixStats
package, but ran into an error:
library(matrixStats)
food.matrix <- as.matrix(food.clean)
cbind(food.clean, mean=rowMeans(food.matrix), sd=rowSds(food.matrix), max=rowMaxs(food.matrix))
Error in rowMeans(food.matrix) : 'x' must be numeric
Any pointers? Thanks!
Upvotes: 0
Views: 2313
Reputation: 146
I guess that you would like to get the mean, min, and max for all numeric variables for each row.
If you have lots of numerical variables, it may be easier to tidy the data first as described in http://www.jstatsoft.org/v59/i10/
For instance, you can do the following.
library(dplyr)
library(tidyr)
df <- read.csv(text="
translation,category,macrocategory,subcategory,1640,1641,1642,1643,1644
almonds,nuts,Flavoring/Other,,,491,,,
apples,Fruit,Fruits and Vegetables,42,,67,,,
Atlantic herring,Fish,Meat,,52,0,9,,
aurochs,Meat,Meat,game,,4,25,5,
bacon,Meat,Meat,pork,,275.87,78,92,0
barley groats,Grain,Grain,5,9,2,14,56,9
beef,Meat,Meat,Beef,,5.25,,,"
)
info <-
df %>%
# tidy data
gather(variable, value, -(1:4)) %>%
# summarise by food item 'translation'
group_by(translation) %>%
summarise(
mean = mean(value, na.rm=TRUE),
min = min(value, na.rm=TRUE),
max = max(value, na.rm=TRUE)
)
This will give you
translation mean min max
1 almonds 491.00000 491.00 491.00
2 apples 67.00000 67.00 67.00
3 Atlantic herring 20.33333 0.00 52.00
4 aurochs 11.33333 4.00 25.00
5 bacon 111.46750 0.00 275.87
6 barley groats 18.00000 2.00 56.00
7 beef 5.25000 5.25 5.25
If you still need the original numerical variables, you can join this info
with df
.
df %>% left_join(info)
For more details, I would highly recommend reading up on tidy data paper and study dplyr
and tidyr
package.
Upvotes: 1
Reputation: 20463
It's not entirely clear what your sample data looks like (specifically, the 1640,1641,1642,1643,1644 in your header row), but this should show you one way of doing this with the mtcars
sample dataset. If you could dput
your actual data, it would be easier to help. Specifically, you could use dplyr::rowwise
:
library(dplyr)
df <- tbl_df(mtcars)
head(df)
# Source: local data frame [6 x 11]
#
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
df %>%
rowwise() %>%
summarise(min = min(am, gear, carb),
max = max(am, gear, carb),
mean = mean(am, gear, carb))
# min max mean
# 1 1 4 1
# 2 1 4 1
# 3 1 4 1
# 4 0 3 0
# 5 0 3 0
# 6 0 3 0
# 7 0 4 0
# 8 0 4 0
# 9 0 4 0
# 10 0 4 0
# .. ... ... ...
Upvotes: 0