Reputation: 135
I'm running a hurdle type analysis on species distribution data which involves two fitting steps. The first step is to model (m1) presence/absence data using all data with family=quasibinomial. The second step (m2) is to use positive presence only data with family=Gamma. This works wonderfully until I try to predict using the second model (m2) on the full dataset I receive an error due to new factor levels. I understand why I am receiving this error; there are factor levels that appear in the full dataset that are not present in the reduce (presence only) dataset. My question is how do I work around this error so that I can get predictions using the second model on the full set? I am using mgcv.
Edit: Updated with additional code and data.
# Step1 - GAM using full dataset for presence/absense
grays<-structure(list(Grid_ID = structure(c(39L, 51L, 52L, 67L), .Label = c("1",
"1,000", "1,001", "1,008", "1,009", "1,010", "1,011", "1,012",
"1,013", "1,014", "1,015", "1,016", "1,022", "1,023", "1,024",
"1,025", "1,026", "1,027", "1,028", "1,029", "1,034", "1,035",
"1,036", "1,037", "1,039", "1,040", "1,045", "1,046", "1,047",
"1,048", "1,053", "1,054", "1,055", "10", "100", "101", "103",
"104", "105", "106", "107", "108", "109", "11", "110", "118",
"119", "12", "122", "125", "126", "127", "128", "129", "13",
"130", "131", "132", "133", "14", "141", "142", "15", "150",
"151", "152", "153", "154", "155", "156", "157", "158", "159",
"160", "161", "162", "163", "167", "168", "169", "173", "174",
"175", "176", "177", "178", "179", "180", "181", "182", "183",
"184", "185", "188", "189", "190", "196", "197", "198", "199",
"2", "20", "200", "201", "202", "203", "204", "205", "206", "207",
"209", "210", "211", "219", "22", "220", "221", "222", "223",
"224", "225", "226", "227", "228", "229", "23", "230", "231",
"233", "234", "235", "236", "237", "24", "246", "247", "248",
"249", "25", "250", "252", "253", "254", "255", "256", "257",
"258", "259", "26", "260", "261", "267", "268", "269", "27",
"270", "271", "272", "273", "274", "275", "276", "277", "278",
"279", "28", "280", "281", "286", "287", "288", "289", "29",
"290", "291", "292", "293", "294", "295", "296", "297", "298",
"299", "3", "300", "301", "302", "303", "305", "306", "307",
"308", "309", "310", "311", "312", "313", "314", "315", "316",
"317", "318", "319", "320", "321", "326", "327", "328", "329",
"330", "331", "332", "333", "334", "335", "336", "337", "339",
"340", "341", "343", "344", "345", "346", "347", "348", "349",
"350", "351", "352", "355", "356", "357", "36", "360", "361",
"362", "363", "364", "365", "366", "367", "368", "369", "37",
"372", "373", "374", "38", "380", "381", "382", "383", "384",
"385", "386", "39", "391", "392", "397", "398", "399", "4", "40",
"400", "401", "402", "408", "409", "41", "410", "412", "413",
"414", "415", "416", "417", "42", "423", "424", "425", "426",
"43", "430", "431", "432", "433", "434", "44", "441", "442",
"443", "444", "447", "448", "449", "45", "450", "451", "458",
"459", "46", "460", "461", "462", "463", "464", "465", "466",
"470", "471", "472", "473", "474", "475", "476", "484", "485",
"486", "487", "488", "489", "490", "491", "492", "496", "497",
"498", "499", "5", "500", "501", "513", "514", "515", "516",
"517", "518", "523", "524", "525", "526", "527", "528", "529",
"54", "541", "542", "543", "544", "545", "55", "550", "551",
"552", "553", "554", "56", "569", "57", "570", "571", "572",
"573", "574", "578", "579", "580", "581", "582", "599", "60",
"600", "601", "602", "603", "604", "605", "606", "607", "608",
"609", "61", "610", "62", "626", "627", "628", "629", "63", "632",
"633", "634", "635", "636", "637", "638", "639", "64", "653",
"654", "655", "656", "657", "658", "659", "660", "663", "664",
"665", "666", "667", "668", "669", "670", "671", "672", "673",
"687", "688", "689", "690", "691", "692", "693", "696", "697",
"698", "699", "7", "700", "701", "702", "703", "704", "705",
"716", "717", "718", "720", "721", "722", "723", "724", "725",
"726", "727", "728", "739", "74", "740", "741", "746", "747",
"748", "749", "75", "750", "751", "752", "753", "754", "764",
"765", "768", "769", "77", "770", "771", "772", "773", "78",
"782", "783", "784", "788", "789", "79", "790", "798", "799",
"8", "80", "800", "801", "804", "805", "81", "812", "813", "814",
"815", "816", "819", "82", "820", "821", "827", "828", "829",
"83", "830", "831", "833", "834", "835", "836", "84", "842",
"843", "844", "845", "846", "849", "85", "850", "851", "852",
"853", "854", "860", "861", "862", "863", "864", "869", "870",
"871", "872", "873", "874", "88", "881", "882", "883", "884",
"885", "886", "89", "890", "891", "892", "893", "894", "9", "902",
"903", "904", "905", "906", "908", "909", "910", "911", "912",
"922", "923", "924", "925", "926", "927", "928", "929", "930",
"940", "941", "942", "943", "944", "945", "946", "947", "948",
"957", "958", "959", "96", "960", "961", "962", "963", "964",
"965", "966", "97", "976", "977", "978", "979", "980", "981",
"982", "983", "984", "992", "993", "994", "995", "996", "997",
"998", "999"), class = "factor"), Grid_Lat = c(56.85582097, 56.90062505,
56.90024495, 56.94461032), Grid_Long = c(153.4783612, 153.4777153,
153.3954873, 153.3124098), Er_Pres = c(0L, 0L, 0L, 0L), Er_Count = c(0L,
0L, 0L, 0L), Er_Count_Density = c(0, 0, 0, 0), Month = structure(c(8L,
8L, 8L, 8L), .Label = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11"), class = "factor"), Year = structure(c(1L, 1L,
1L, 1L), .Label = c("1997", "1998", "1999", "2000", "2001", "2002",
"2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010",
"2011", "2012", "2013"), class = "factor"), chl = c(0.53747,
0.53747, 0.53747, 0.581741), SST = c(13.4171, 13.4171, 13.4171,
13.4025002), Bathymetry = c(76.11354065, 92.14147949, 90.60312653,
71.55316162), Grid_Area = c(25, 25, 25, 25), DFS = c(6.807817092,
4.233185446, 9.199096676, 5.153224038), Slope = c(0.13670446,
0.38316911, 0.08646853, 0.20038579), DOY = c(244L, 244L, 244L,
244L)), .Names = c("Grid_ID", "Grid_Lat", "Grid_Long", "Er_Pres",
"Er_Count", "Er_Count_Density", "Month", "Year", "chl", "SST",
"Bathymetry", "Grid_Area", "DFS", "Slope", "DOY"), row.names = c(NA,
4L), class = "data.frame")
m1<-gam(Er_Pres~ s(Grid_Lat,Grid_Long,k=10,bs='tp')+Month+Year+s(SST,k=5,bs='tp'),family=quasibinomial(link='logit'),data=grays,gamma=1.4,offset(Grid_Area))
#step 2 - reduce dataset and run second GAM for positive abundance only.
grays2<-subset(grays,Er_Pres>0)
m2<-gam(Er_Count~ Year +s(Grid_Lat,Grid_Long,k=10,bs='tp') + s(SST,k=5,bs='tp') + s(sqrt(DFS),k=5,bs='tp') + Month +log10(chl),family=Gamma(link='log'),data=grays2,Gamma=1.4,offset(Grid_Area))
Running the second model gives me the follow error:
Error in predict.gam(m2, newdata = full, type = "response") :
1997, 1998, 2006, 2007 not in original fit
Upvotes: 5
Views: 1815
Reputation: 2180
In your particular case, years can easily be handled by changing them to integers. This is probably a superior solution to treating them as factors because it captures the information that 1997 follows 1996 but is before 1998, etc.
However, for a more general solution to factors not found in a second test set, what I do is reencode the original problematic factor so that levels with too few cases are lumped together as a generic "other" category. This is easily accomplished with forcats::fct_lump. That way, the training subset knows how to handle the infrequent "other" category, which is what is then presented in the testing subset.
Upvotes: 0
Reputation: 1
This is an old post, so I suspect you have found a solution by now, but if not consider this:
If you only want to account for data within the same year being more similar than data across year, but you are not necessarily interested in the effect of particular years (say the difference between 2007 and 1998) then you could specify year as a random effect.
I believe there are several ways to do this, but in mgcv, you can specify:
s(Year, bs="re")
Upvotes: 0