Reputation: 12074

Least square means

I'm having problems understanding the behaviour of least square means. Below is a toy example that uses a random data set to demonstrate my issue. The scenario is this: there are 10 stations that are sampled for a metric called density in either fall or spring between 1999 and 2015.

# Number of observations in data set
n.obs <- 1000

# Create dummy data set 
df.tst <- data.frame(density = runif(n.obs, 0, 1),
                 year = as.factor(round(runif(n.obs, 1999, 2015))),
                 season = sample(c("Fall", "Spring"), n.obs, replace= TRUE),
                 station = sample(paste("Stat", 1:10), n.obs, replace= TRUE))

Here, I have assigned sampling randomly to purposely represent a patchy data set. Since the data set is patchy, I use least square means so that I can estimate density for a season and year while trying to avoid effects of sampling bias.

# Fit linear model to data
fitted.model <- lm(density ~ year + season + station, data=df.tst)

# Calculate least square means
seasonal.model <- summary(lsmeans(fitted.model, c("year", "season")))

To compare results, I create a metric that I call "anomaly" that is the value for a particular year/season minus the mean for all years (to centre values) and normalised by the standard deviation. The idea is that this provides some standardised metric of how much, say, spring of 2005 varies from other spring observations. So...

# Anomaly for spring
seasonal.model %>% 
  filter(season == "Spring") %>% 
  mutate(anom = (lsmean - mean(lsmean))/sd(lsmean))

Which is great. What I don't understand is why when I do this for each season, I get the same answer. For example...

# Anomaly for fall
seasonal.model %>% 
  filter(season == "Fall") %>% 
  mutate(anom = (lsmean - mean(lsmean))/sd(lsmean))

These give,

   year season    lsmean         SE  df  lower.CL  upper.CL         anom
1  1999 Spring 0.5966423 0.05242108 973 0.4937709 0.6995137  1.879970679
2  2000 Spring 0.4510249 0.03717688 973 0.3780688 0.5239810 -1.617190892
3  2001 Spring 0.4683691 0.03713725 973 0.3954908 0.5412474 -1.200649943
4  2002 Spring 0.4451014 0.03730148 973 0.3719008 0.5183020 -1.759450098
5  2003 Spring 0.5035975 0.03881258 973 0.4274315 0.5797635 -0.354600778
6  2004 Spring 0.5263538 0.03505567 973 0.4575604 0.5951472  0.191916022
7  2005 Spring 0.5553849 0.03703036 973 0.4827163 0.6280535  0.889129265
8  2006 Spring 0.5182714 0.03626301 973 0.4471087 0.5894341 -0.002191364
9  2007 Spring 0.4765210 0.04097960 973 0.3961024 0.5569395 -1.004874623
10 2008 Spring 0.5465877 0.04483499 973 0.4586033 0.6345721  0.677854169
11 2009 Spring 0.4959347 0.03185768 973 0.4334170 0.5584523 -0.538633207
12 2010 Spring 0.5359122 0.03934735 973 0.4586968 0.6131276  0.421471255
13 2011 Spring 0.5533493 0.03461044 973 0.4854296 0.6212690  0.840241309
14 2012 Spring 0.5465454 0.03697864 973 0.4739783 0.6191124  0.676838641
15 2013 Spring 0.5594985 0.03436268 973 0.4920650 0.6269320  0.987923047
16 2014 Spring 0.5320895 0.03825361 973 0.4570205 0.6071586  0.329666086
17 2015 Spring 0.5009818 0.04771979 973 0.4073363 0.5946274 -0.417419568

and

   year season    lsmean         SE  df  lower.CL  upper.CL         anom
1  1999   Fall 0.5683228 0.05226823 973 0.4657514 0.6708943  1.879970679
2  2000   Fall 0.4227054 0.03638423 973 0.3513048 0.4941060 -1.617190892
3  2001   Fall 0.4400496 0.03752816 973 0.3664042 0.5136951 -1.200649943
4  2002   Fall 0.4167819 0.03741172 973 0.3433649 0.4901989 -1.759450098
5  2003   Fall 0.4752781 0.04039258 973 0.3960115 0.5545447 -0.354600778
6  2004   Fall 0.4980343 0.03492573 973 0.4294959 0.5665728  0.191916022
7  2005   Fall 0.5270654 0.03591814 973 0.4565795 0.5975514  0.889129265
8  2006   Fall 0.4899519 0.03618696 973 0.4189385 0.5609654 -0.002191364
9  2007   Fall 0.4482015 0.04026627 973 0.3691828 0.5272202 -1.004874623
10 2008   Fall 0.5182682 0.04545050 973 0.4290759 0.6074605  0.677854169
11 2009   Fall 0.4676152 0.03210064 973 0.4046207 0.5306096 -0.538633207
12 2010   Fall 0.5075927 0.03880628 973 0.4314391 0.5837464  0.421471255
13 2011   Fall 0.5250298 0.03440547 973 0.4575123 0.5925473  0.840241309
14 2012   Fall 0.5182259 0.03755452 973 0.4445287 0.5919231  0.676838641
15 2013   Fall 0.5311791 0.03463023 973 0.4632205 0.5991376  0.987923047
16 2014   Fall 0.5037701 0.03826525 973 0.4286782 0.5788620  0.329666086
17 2015   Fall 0.4726624 0.04793952 973 0.3785856 0.5667391 -0.417419568

I expect I am misunderstanding something very simple, but why should spring and fall anomalies be exactly the same for a given year even though the actual least square means are actually different. Any insights would be appreciated...

EDIT: In response to bethanyP. This is from another run, so may differ a smidge since the data are random.

Classes ‘summary.ref.grid’ and 'data.frame':    34 obs. of  7 variables:
 $ year    : Factor w/ 17 levels "1999","2000",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ season  : Factor w/ 2 levels "Fall","Spring": 1 1 1 1 1 1 1 1 1 1 ...
 $ lsmean  : num  0.484 0.461 0.441 0.495 0.575 ...
 $ SE      : num  0.046 0.0361 0.0355 0.0408 0.0347 ...
 $ df      : num  973 973 973 973 973 973 973 973 973 973 ...
 $ lower.CL: num  0.394 0.39 0.371 0.415 0.507 ...
 $ upper.CL: num  0.574 0.532 0.511 0.575 0.643 ...
 - attr(*, "estName")= chr "lsmean"
 - attr(*, "clNames")= chr  "lower.CL" "upper.CL"
 - attr(*, "pri.vars")= chr  "year" "season"
 - attr(*, "mesg")= chr  "Results are averaged over the levels of: station" "Confidence level used: 0.95"

Upvotes: 3

Answers (2)

Russ Lenth

Reputation: 6760

The reason is because you fitted an additive model, implying that you assume the year effects are the same in each season. In other words, the relationship among the predictions is exactly the same each season.

Least-squares means are based solely on model predictions -- so the model matters! If you fit a model with interactions, then you will get different anomalies each season.

Upvotes: 2

sconfluentus

Reputation: 4993

Try this, I think it might not be filtering because you gave it a string, but season is a factor:

seasonal.model %>% 
  filter(season == as.factor("Spring")) %>% 
  mutate(anom = (lsmean - mean(lsmean))/sd(lsmean))

If that is so, this should work you can set: spring = as.factor("Spring") then just feed spring into your pipes.

Try this instead:seasonal.model %>% group_by(year, season) %>% summarize(anom = (lsmean - mean(lsmean))/sd(lsmean))

Upvotes: 1

Least square means

Answers (2)

Related Questions