Reputation: 11
I'm trying to use tapply to get the average weight of turtles caught per day. tapply returns NA for every date value (class:POSIXct) for every approach I've tried
I've tried: calling tapply on the weight column and date column -> arguments are different lengths error
removing records with NA values in the weight column of my dataframe then calling tapply on the weight column and date column. -> arguments are different lengths error
calling tapply on the na.omit call of the weight column and the date column indexed by the na.omit call of the weight column -> arguments are different lengths error
calling tapply on the na.omit call of the weight column and the factor-coerced date column indexed by the na.omit call of the weight column -> returns NA for every level of the factor-coerced date column
> head(stinkpotData)
Date DateCt Species Turtle.ID ID.Code Location Recapture Weight.g C.Length.mm
1 6/1/2001 2001-06-01 Stinkpot 1 1 keck lab dock site 0 190 95
2 6/1/2001 2001-06-01 Stinkpot 2 10 Right of dock 0 200 100
3 8/9/2001 2001-08-09 Stinkpot 2 10 #4 Deep Right of lab 1 175 104
4 8/27/2001 2001-08-27 Stinkpot 2 10 #4 Deep Right of lab 1 175 105
5 6/1/2001 2001-06-01 Stinkpot 3 11 Right of dock 0 200 109
6 10/3/2001 2001-10-03 Stinkpot 3 11 #4 Deep Right of lab 1 205 109
C.Width.mm Female.1.Male.2 Rotation Marks
1 70 <NA> <NA> <NA>
2 72 <NA> <NA> <NA>
3 72 2 <NA> Male
4 71 2 <NA> male, 1 small leech Right front leg
5 74 <NA> <NA> algae covered
6 76 2 <NA> male, 1 lg & 1 sm leech right rear leg
> head(noNAWeightsDf)
Date DateCt Species Turtle.ID ID.Code Location Recapture Weight.g C.Length.mm
1 6/1/2001 2001-06-01 Stinkpot 1 1 keck lab dock site 0 190 95
2 6/1/2001 2001-06-01 Stinkpot 2 10 Right of dock 0 200 100
3 8/9/2001 2001-08-09 Stinkpot 2 10 #4 Deep Right of lab 1 175 104
4 8/27/2001 2001-08-27 Stinkpot 2 10 #4 Deep Right of lab 1 175 105
5 6/1/2001 2001-06-01 Stinkpot 3 11 Right of dock 0 200 109
6 10/3/2001 2001-10-03 Stinkpot 3 11 #4 Deep Right of lab 1 205 109
C.Width.mm Female.1.Male.2 Rotation Marks
1 70 <NA> <NA> <NA>
2 72 <NA> <NA> <NA>
3 72 2 <NA> Male
4 71 2 <NA> male, 1 small leech Right front leg
5 74 <NA> <NA> algae covered
6 76 2 <NA> male, 1 lg & 1 sm leech right rear leg
> tapply(stinkpotData$Weight.g, stinkpotData$DateCt, FUN = mean)
Error in tapply(stinkpotData$Weight.g, stinkpotData$DateCt, FUN = mean) :
arguments must have same length
>tapply(noNAWeightsDf$Weight.g, noNAWeightsDf$DateCt, FUN = mean)
Error in tapply(noNAWeightsDf$Weight.g, noNAWeightsDf$DateCt, FUN = mean) :
arguments must have same length
> tapply(na.omit(stinkpotData$Weight.g), stinkpotData$DateCt[!is.na(stinkpotData$Weight.g)], FUN = mean)
Error in tapply(na.omit(stinkpotData$Weight.g), stinkpotData$DateCt[!is.na(stinkpotData$Weight.g)], :
arguments must have same length
coerced date column indexed by the na.omit call of the weight column
tapply(na.omit(stinkpotData$Weight.g), as.factor(stinkpotData$DateCt[!is.na(stinkpotData$Weight.g)]), FUN = mean)
2001-01-07 2001-06-01 2001-06-04 2001-06-06 2001-06-07 2001-06-11 2001-06-12 2001-06-15 2001-06-19
NA NA NA NA NA NA NA NA NA
2001-06-20 2001-06-25 2001-06-27 2001-06-29 2001-07-03 2001-07-09 2001-07-11 2001-07-13 2001-07-16
NA NA NA NA NA NA NA NA NA ................etc
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
.......................etc
EDIT:
split(na.omit(stinkpotData$Weight.g), as.factor(stinkpotData$DateCt[!is.na(stinkpotData$Weight.g)]))
Gave a list of the individual weights of turtles on each date. Verified that it was of mode list. Its elements were of mode numeric, class factor. lapply on the split list with FUN=mean still returned NA for each level of date. Can get means of individual elements of the split list of coerced to vectors but not quite what I need.
EDIT 2: Finally got the result I wanted, but the steps to get there seem over-complicated and I still don't understand why using tapply won't work. I had to call split as in the first edit, then coerce each element of the resultant list to class numeric (originally returned as class factor) with lapply, then call mean on every element with lapply:
weightsDateList = split(na.omit(stinkpotData$Weight.g), as.factor(stinkpotData$DateCt[!is.na(stinkpotData$Weight.g)]))
weightsDateList = lapply(weightsDateList, FUN = as.numeric)
weightsDateList = lapply(weightsDateList, FUN = mean)
EDIT 3: I realize now that the result I get from the solution in EDIT 2 and calling tapply( severely underestimates the means, so still lost.
EDIT 4: Realized that converting weight to class numeric returned the number of the level of the weight from when it was a factor, which explains the severe underestimation of means.
I want the tapply call to return every date with turtle weight(s) and its respective average weight of turtles caught on those dates. Thanks and I apologize if I'm missing something easy.
Upvotes: 1
Views: 890
Reputation: 107567
Generally, to use tapply
you must heed the following rules regarding its arguments:
First argument must be or cast-able to a logical, integer, or numeric. Factors, characters, or other types cannot be used here.
Second argument must be or cast-able to a factor which can be any basic data type with exceptions to more complex types. This includes multiple groupings if using list()
where tapply
then returns a matrix.
as.factor()
which likely tapply
does already under the hood.class
object of list
type that contains equal length atomic vectors.
tapply
:NA
maintains a length of one (unlike NULL
), its presence does not matter in tapply
. However, the child function can have issues with NA
that tapply
raises upstream.Specifically, your issue regards the original types: factor type of Weight.g, and POSIXlt
type of DateCt. Consider converting these types to adhere to tapply
.
But do not directly cast these original types to factor
as its underlying numerics or factor level number will result causing undesirable results. For numeric conversion, cast first to character
. For POSIXlt
cast to Date
or character
. Below demonstrates with OP's dput
of first ten rows with other grouping methods.
Data (only two relevant columns)
stinkpotDataDeparsed <- structure(list(Weight.g = structure(c(15L, 13L, 20L, 16L, 15L,
12L, NA, 12L, 15L, 20L, 26L), .Label = c("100", "105", "106",
"107", "110", "115", "1150", "120", "125", "126", "128", "130",
"135", "138", "140", "145", "150", "155", "159", "160", "165",
"168", "170", "175", "180", "185", "187", "190", "195", "20",
"200", "205", "210", "215", "220", "225", "230", "235", "245",
"250", "40", "45", "50", "55", "60", "65", "70", "75", "80",
"85", "90", "95", "oops!"), class = "factor"), DateCt = structure(list(
sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(20L, 30L, 8L, 29L,
23L, 26L, 12L, 17L, 29L, 13L, 4L), mon = c(8L, 8L, 10L, 10L,
5L, 5L, 6L, 6L, 6L, 5L, 5L), year = c(101L, 101L, 101L, 101L,
102L, 102L, 102L, 102L, 102L, 103L, 101L), wday = c(4L, 0L,
4L, 4L, 0L, 3L, 5L, 3L, 1L, 5L, 1L), yday = c(262L, 272L,
311L, 332L, 173L, 176L, 192L, 197L, 209L, 163L, 154L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("EST",
"EST", "EST", "EST", "EST", "EST", "EST", "EST", "EST", "EST",
"EST"), gmtoff = c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt"), tzone = c("EST",
"EST", " "))), .Names = c("Weight.g", "DateCt"), row.names = 60:70, class = "data.frame")
Cleaning
# REMOVE NAs FROM DATA FRAME TO RUN ON ALL COLUMNS BUT DOES NOT MATTER W/ tapply
stinkpotDataDeparsed <- stinkpotDataDeparsed[!is.na(stinkpotDataDeparsed$Weight.g),]
# CAST FACTOR TYPE TO NUMERIC
stinkpotDataDeparsed$Weight.g <- as.numeric(as.character(stinkpotDataDeparsed$Weight.g))
# CAST POISXlt TO DATE OR CHARACTER FOR FACTOR-ABILITY
stinkpotDataDeparsed$DateCt <- as.Date(stinkpotDataDeparsed$DateCt)
# stinkpotDataDeparsed$DateCt <- as.character(stinkpotDataDeparsed$DateCt)
Tapply (returns a vector)
with(stinkpotDataDeparsed, tapply(Weight.g, DateCt, mean))
# 2001-06-04 2001-09-20 2001-09-30 2001-11-08 2001-11-29 2002-06-23 2002-06-26 2002-07-17 2002-07-29 2003-06-13
# 185 140 135 160 145 140 130 130 140 160
Aggregate (returns a data frame)
aggregate(Weight.g ~ DateCt, data = stinkpotDataDeparsed, mean)
# DateCt Weight.g
# 1 2001-06-04 185
# 2 2001-09-20 140
# 3 2001-09-30 135
# 4 2001-11-08 160
# 5 2001-11-29 145
# 6 2002-06-23 140
# 7 2002-06-26 130
# 8 2002-07-17 130
# 9 2002-07-29 140
# 10 2003-06-13 160
Ave (returns vector of same length as input, so can be assigned a data frame column)
stinkpotDataDeparsed$Wgt.Mean <- with(stinkpotDataDeparsed, ave(Weight.g, DateCt, FUN=mean))
stinkpotDataDeparsed
# Weight.g DateCt Wgt.Mean
# 60 140 2001-09-20 140
# 61 135 2001-09-30 135
# 62 160 2001-11-08 160
# 63 145 2001-11-29 145
# 64 140 2002-06-23 140
# 65 130 2002-06-26 130
# 67 130 2002-07-17 130
# 68 140 2002-07-29 140
# 69 160 2003-06-13 160
# 70 185 2001-06-04 185
By (object-oriented wrapper to tapply
, returns a list)
by(stinkpotDataDeparsed, stinkpotDataDeparsed$DateCt, FUN=function(sub) mean(sub$Weight.g))
# stinkpotDataDeparsed$DateCt: 2001-06-04
# [1] 185
# ------------------------------------------------------------
# stinkpotDataDeparsed$DateCt: 2001-09-20
# [1] 140
# ------------------------------------------------------------
# stinkpotDataDeparsed$DateCt: 2001-09-30
# [1] 135
# ------------------------------------------------------------
# stinkpotDataDeparsed$DateCt: 2001-11-08
# [1] 160
# ------------------------------------------------------------
# stinkpotDataDeparsed$DateCt: 2001-11-29
# [1] 145
# ------------------------------------------------------------
# stinkpotDataDeparsed$DateCt: 2002-06-23
# [1] 140
# ------------------------------------------------------------
# stinkpotDataDeparsed$DateCt: 2002-06-26
# [1] 130
# ------------------------------------------------------------
# stinkpotDataDeparsed$DateCt: 2002-07-17
# [1] 130
# ------------------------------------------------------------
# stinkpotDataDeparsed$DateCt: 2002-07-29
# [1] 140
# ------------------------------------------------------------
# stinkpotDataDeparsed$DateCt: 2003-06-13
# [1] 160
Upvotes: 0