Reputation: 2832
I'm trying to scrape this table using R. So far, I've managed to get only 27 lines of it, using the code below. I would like to get all the entries back and, ideally, modify the request so that I can select certain years etc. Other questions on SO target slightly different situations, and I would like to keep this in the rvest-xml2-httr world, if possible.
url <- "http://myfwc.com/wildlifehabitats/managed/alligator/harvest/data-export/"
view <- httr::POST(url) %>%
xml2::read_html() %>%
rvest::html_nodes("input[name='__VIEWSTATE']") %>%
rvest::html_attr("value")
param <- list(`__EVENTTARGET` = "",
`__EVENTARGUMENT` = "",
`__VIEWSTATE` = view,
`ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl00$RefreshButton` = "",
`ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_Year` = "",
`ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaNumber` = "",
`ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaName` = "",
`ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox` = "10000",
`ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState` = "",
`ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_rfltMenu_ClientState` = "",
`ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ClientState` = "",
`__VIEWSTATEGENERATOR` = "CA0B0334")
request <- httr::POST(url,
body = param,
encode = 'form') %>%
xml2::read_html() %>%
rvest::html_table(fill = T)
tib <- request[[1]]
> dim(tib)
[1] 27 9
Upvotes: 0
Views: 729
Reputation: 160607
The table in question has a "Export to CSV" link:
If you click on it, you get the 6.36MB CSV file directly, which is good. I'm assuming that you need/want to do this programmatically, so this worked for me:
Right-click on the "POST" line and select "Copy POST Data"; this provides:
__EVENTTARGET
__EVENTARGUMENT
__VIEWSTATE=...
ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl00$ExportToCsvButton=+
ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_Year
ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaNumber
ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl03$FilterTextBox_AreaName
ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox=20
ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState
ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_rfltMenu_ClientState
ctl00_ctl00_ctl00_ctl00_ctl00_ContentPlaceHolderDefault_pageBody_pageBody_rightColumn_ctl01_AlligatorHarvestExport_6_RadGrid1_ClientState
__VIEWSTATEGENERATOR=CA0B0334
(I replaced the long base64-string with "...
".) The notable line is the fourth, ending in $ExportToCsvButton=+
. This is the parameter you need to include in your POST data (param
).
Using your code above up through and including defining param
, continue with:
param$`ctl00$ctl00$ctl00$ctl00$ctl00$ContentPlaceHolderDefault$pageBody$pageBody$rightColumn$ctl01$AlligatorHarvestExport_6$RadGrid1$ctl00$ctl02$ctl00$ExportToCsvButton` <- "+"
request <- httr::POST(url, body = param, encode = 'form')
You'll now have:
request
# Response [http://myfwc.com/wildlifehabitats/managed/alligator/harvest/data-export/]
# Date: 2017-06-01 18:09
# Status: 200
# Content-Type: text/csv; charset-UTF-8;
# Size: 6.36 MB
# <U+FEFF>"Year","Area Number","Area Name","Carcass Size","Harvest Date","Location"
# "2000","101","LAKE PIERCE","11 ft. 5 in.","09-22-2000",""
# "2000","101","LAKE PIERCE","9 ft. 0 in.","10-02-2000",""
# "2000","101","LAKE PIERCE","8 ft. 10 in.","10-06-2000",""
# "2000","101","LAKE PIERCE","8 ft. 0 in.","09-25-2000",""
# "2000","101","LAKE PIERCE","8 ft. 0 in.","10-07-2000",""
# "2000","101","LAKE PIERCE","8 ft. 0 in.","09-22-2000",""
# "2000","101","LAKE PIERCE","7 ft. 2 in.","09-21-2000",""
# "2000","101","LAKE PIERCE","7 ft. 1 in.","09-21-2000",""
# "2000","101","LAKE PIERCE","6 ft. 11 in.","09-25-2000",""
# ...
Side note: the website starts the file with <U+FEFF>
, a unicode character. This throws off read.csv
and gives you a column name of X.U.FEFF.Year
, is entirely cosmetic.
If you don't care about the suggested filename, you can simply do
write(as.character(request), file="quux.csv")
If you want to use the filename the website suggests for it, you can find it with:
httr::headers(request)$`content-disposition`
# [1] "inline;filename=\"FWCAlligatorHarvestData.csv\""
Parsing that should be straight-forward.
If you don't want/need to save to an intermediate file, you can always consume it immediately:
head(read.csv(textConnection(as.character(request))))
# Invalid encoding : defaulting to UTF-8.
# X.U.FEFF.Year Area.Number Area.Name Carcass.Size Harvest.Date Location
# 1 2000 101 LAKE PIERCE 11 ft. 5 in. 09-22-2000
# 2 2000 101 LAKE PIERCE 9 ft. 0 in. 10-02-2000
# 3 2000 101 LAKE PIERCE 8 ft. 10 in. 10-06-2000
# 4 2000 101 LAKE PIERCE 8 ft. 0 in. 09-25-2000
# 5 2000 101 LAKE PIERCE 8 ft. 0 in. 10-07-2000
# 6 2000 101 LAKE PIERCE 8 ft. 0 in. 09-22-2000
Upvotes: 3