Reputation: 845
I have a dataframe similar to built-in InsectSprays (with factor and numeric data), but it contains 10+ numeric and 20+ factor vectors with few NAs. When I boxplot(numeric ~ factor), I notice that some levels stand out, and I want to be able to compare them with the rest.
As an example: InsectSprays contains a numeric vector called count (0:26), and a factor vector called sprays with levels: A, B, C, D, E and F. In InsectSprays, C is lowest, so I want to cbe able to compare C with all others.
I wrote a function for such numeric vectors:
num_interlevel <- function(df, variable, category){
#find the levels of the categorizing parameter
level.list <- levels(category)
#build enough columns in the plot area
par(mfrow=c(1,length(level.list)))
for(i in 1:length(level.list)){
#subset the df containing only the level in question
variable.df <- na.omit(df[which(category == level.list[i]),])
#subset the df containing all other levels
category.df <- na.omit(df[which(category != level.list[i]),])
boxplot(variable.df[, variable], category.df[, variable])
p <- t.test(variable.df[, variable], category.df[, variable])$p.value
title(paste(level.list[i], "=", p))
}
}
and num_interlevel(InsectSprays, "count", InsectSprays$spray)
gives me the result I want.
But when it comes to comparing factor vectors with each other (and I used tables for that), it doesn't work, simply because the dataframes are of different size, and more importantly, because this is a wrong way.
Then I thought that there may be an existing function for that, but couldn't find any. Can anyone suggest a way/function to create one subset containing exactly one level and another subset containing all the other levels?
#dput:
structure(list(Yas = c(27, 18, 17, 18, 18), Cinsiyet = structure(c(2L,
2L, 2L, 1L, 1L), .Label = c("Erkek", "Kadın"), class = "factor"),
Ikamet = structure(c(5L, 4L, 3L, 3L, 5L), .Label = c("Aileyle",
"Akrabayla", "Arkadaşla", "Devlet yurdu", "Diğer", "Özel yurt",
"Tek başına"), class = "factor"), Aile_birey = c(13, 9, 6,
10, 6), Aile_gelir = c(700, 1000, 1500, 600, 800)), .Names = c("Yas",
"Cinsiyet", "Ikamet", "Aile_birey", "Aile_gelir"), row.names = c(NA,
5L), class = "data.frame")
I reformed my functions after James's answer. This is certainly not an elegant solution, but I put it here for future reference:
n.compare <- function(df, variable, category){
level.list <- levels(df[,category])
par(mfrow=c(1,length(level.list)))
for(i in 1:length(level.list)){
boxplot(df[,variable] ~ (df[,category] == level.list[i]))
p <- t.test(df[,variable] ~ (df[,category] == level.list[i]))$p.value
title(paste(level.list[i], "=", p))
}
}
f.compare <- function(df, variable, category){
level.list <- levels(df[,category])
par(mfrow=c(1,length(level.list)))
for(i in 1:length(level.list)){
print(paste(level.list[i]))
print(table((df[,category] == level.list[i]), df[,variable]))
writeLines("\n")
}
}
Upvotes: 0
Views: 1308
Reputation: 66834
To split up a data.frame, use split
:
lapply(split(InsectSprays,InsectSprays$spray=="A"),summary)
$`FALSE`
count spray
Min. : 0.00 A: 0
1st Qu.: 3.00 B:12
Median : 5.00 C:12
Mean : 8.50 D:12
3rd Qu.:13.25 E:12
Max. :26.00 F:12
$`TRUE`
count spray
Min. : 7.00 A:12
1st Qu.:11.50 B: 0
Median :14.00 C: 0
Mean :14.50 D: 0
3rd Qu.:17.75 E: 0
Max. :23.00 F: 0
Also consider the following:
boxplot(count~(spray=="A"),InsectSprays)
Upvotes: 2