Removing observations based on leverage values

Question

After inspecting my data, there seem to be quite some larger than expected (or hoped) leverage values in my lm() model. I wish to remove them based on which observation has exceeded the leverage threshold.

df <- df[ df$title_year >= 2000 , ] 
model <- lm( df$gross ~ . , data= df ) 

lev <- hatvalues( model3 ) 
lev_val <- lev[ lev > 2 * 8/2546

> names(lev_val)
  [1] "282"  "90"   "103"  "25"   "189"  "53"   "147"  "51"   "1526" "1"    "133"  "34"   "273"  "1852" "2172" "5"    "916"  "200"  "79"   "50"  
 [21] "29"   "632"  "266"  "149"  "28"   "26"   "107"  "105"  "202"  "144"  "203"  "8"    "414"  "54"   "279"  "394"  "4"    "155"  "10"   "1539"
 [41] "27"   "1318" "153"  "109"  "1790" "1327" "918"  "1275" "1055" "85"   "11"   "1543" "919"  "33"   "481"  "759"  "43"   "2404" "30"   "920" 
 [61] "212"  "123"  "42"   "223"  "58"   "118"  "111"  "32"   "281"  "88"   "1075" "1061" "421"  "1517" "487"  "2084" "774"  "934"  "1069" "86"  
 [81] "113"  "221"  "37"   "60"   "112"  "304"  "1347" "117"  "697"  "102"  "1472" "225"  "773"  "2"    "219"  "121"  "151"  "1856" "18"   "122" 
[101] "182"  "518"  "1084" "2014" "776"  "300"  "71"   "125"  "2070" "1854" "97"   "1288" "38"   "1087" "2004" "227"  "1747" "64"   "98"   "264" 
[121] "23"   "1729" "150"  "65"   "1559" "36"   "52"   "13"   "128"  "328"  "39"   "524"  "886"  "613"  "2195" "2441" "2446" "2096" "84"   "1522"
[141] "15"   "1800" "462"  "41"   "1100" "1804" "14"   "162"  "1336" "232"  "2193" "24"   "529"  "2088" "787"  "68"   "433"  "785"  "66"   "1340"
[161] "1863" "1587" "788"  "950"  "2450" "1717" "158"  "682"  "2531" "951"  "1110" "535"  "539"  "1478" "35"   "72"   "2136" "547"  "1887" "21"  
[181] "2208" "1873" "1126" "2144" "1805" "2211" "1722" "1874" "1721" "91"   "16"   "2451" "410"  "31"   "1153" "1593" "280"  "1878" "366"  "20"  
[201] "2407" "69"   "1605" "1606" "694"  "1881" "403"  "73"   "303"  "1149" "955"  "1144" "893"  "1138" "2221" "140"  "404"  "819"  "1365" "2225"
[221] "1374" "1690" "1485" "442"  "1150" "1614" "1486" "78"   "1894" "1998" "185"  "2236" "2229" "1373" "1155" "1369" "1817" "184"  "1305" "9"   
[241] "2017" "331"  "1910" "466"  "1621" "566"  "1619" "1372" "2239" "570"  "1908" "947"  "1906" "587"  "170"  "1629" "17"   "308"  "2112" "2257"
[261] "400"  "575"  "1166" "402"  "1745" "1733" "1818" "372"  "962"  "329"  "829"  "966"  "67"   "1174" "2248" "703"  "1622" "1624" "1623" "574" 
[281] "2255" "269"  "1013" "1180" "1387" "1627" "40"   "1924" "2270" "2260" "1631" "95"   "1628" "1184" "2423" "2050" "1481" "1388" "1037" "2060"
[301] "843"  "1746" "1202" "975"  "977"  "1392" "256"  "2109" "2541" "333"  "1495" "1047" "2426" "2478" "2277" "250"  "2282" "981"  "92"   "848" 
[321] "2290" "75"   "1995" "1646" "2143" "688"  "2123" "2488" "2499" "255"  "1941" "1190" "914"  "1189" "2487" "1911" "1768" "382"  "2305" "1512"
[341] "2113" "2127" "2329" "980"  "1203" "1204" "582"  "1949" "2185" "600"  "588"  "1422" "1468" "1963" "2425" "2331" "2494" "2122" "1273" "2415"
[361] "2147" "2335" "2500" "2276" "864"  "2397" "2504" "1235" "1269" "2361" "2360" "2119" "1009" "2371" "2362" "709"  "1976" "1216" "2039" "594" 
[381] "595"  "2517" "77"   "2124" "2373" "1680" "1682" "998"  "259"  "1984" "2386" "1785" "1683" "143"  "1990" "1179" "2163" "2418" "82"

I am not sure on how to remove the excessive leverage values from my data.

I've tried df2 = df[which(names(df) %nin% remove),] but I've had no success with that or minor variations. It removes more rows than there are too large leverage values, probably because it removes the values, not X'th observation.

Anyone has a basic R code for this? I seem to be lost.

Zheyuan Li · Accepted Answer

Purely as a programming issue, you want df[!which(rownames(df) %in% remove), ], as there is no %nin%. Also note, you want row.names not names, as you want to select rows not columns. Or, why not simply use df[lev <= (2 * 8 / 2546), ]?

In your specific context, the first one will give you wrong result. You have subset and overwritten df, after which its row.names is no longer 1,2,3,....

Statistically, high leverage alone does not imply outliers. It is the combination of high leverage and big residual that is problematic. The Cook's Distance is a good measure on this.

Removing observations based on leverage values

Answers (1)

Related Questions