Reputation: 117
I am trying to parse an html page that is displayed in csv format.
I would like to read this table on the fly of the page directly into R (http://www.nwrfc.noaa.gov/esp/esp2csv.cgi?id=slkw1) and then convert the data into a matrix for further crunching.
Not sure if RCurl or XML packages are the best way to do it but have not been able to find a good way to parse the object once I get it into R.
Upvotes: 1
Views: 860
Reputation: 263382
> con=url("http://www.nwrfc.noaa.gov/esp/esp2csv.cgi?id=slkw1")
> table(count.fields(con, sep=","))
1 25521
2 1
So there are two lines with one field only ... turns out that skipping two lines lets the data read succeed.
> con=url("http://www.nwrfc.noaa.gov/esp/esp2csv.cgi?id=slkw1")
> temp <- scan(file=con, skip=2, what="a", sep=",")
# paged through the scan results
> table(grepl("<br />", temp))
FALSE TRUE
25478 43
> con=url("http://www.nwrfc.noaa.gov/esp/esp2csv.cgi?id=slkw1")
> temp <- readLines(con=con)
> temp2=gsub("<br />", "\n", temp)
# Knew that the first two lines were HTML junk.
> temp3 = read.table( text=temp2[3], sep= ",", fill =TRUE, header=TRUE)
> str(temp3)
'data.frame': 43 obs. of 581 variables:
$ X.pre.STA.ID: Factor w/ 1 level "SLKW1": 1 1 1 1 1 1 1 1 1 1 ...
$ START.YEAR : int 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 ...
$ X9.1 : int -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
$ X9.2 : int -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
$ X9.3 : int -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
$ X9.4 : int -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
$ X9.5 : int -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
$ X9.6 : int -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
snipped output There is better data below
$ X9.18 : int 41 41 41 41 41 41 41 41 41 41 ...
$ X9.19 : int 41 41 41 41 41 41 41 41 41 41 ...
$ X9.20 : int 40 40 40 40 40 40 40 40 40 40 ...
$ X9.21 : int 39 39 39 39 39 39 39 39 39 39 ...
$ X9.22 : int 39 39 39 39 39 39 39 39 39 39 ...
$ X9.23 : int 39 39 39 39 39 39 39 39 39 39 ...
$ X9.24 : int 38 38 38 38 38 38 38 38 38 38 ...
$ X9.25 : int 38 38 38 38 38 38 38 38 38 38 ...
$ X9.26 : int 38 38 38 38 38 38 38 38 38 38 ...
$ X9.27 : int 38 38 38 38 38 38 38 38 38 38 ...
$ X9.28 : int 48 53 106 49 89 48 94 117 49 49 ...
$ X9.29 : int 57 45 436 40 403 38 145 294 40 40 ...
$ X9.30 : int 259 34 1270 37 1622 33 80 432 37 37 ...
The last column gets coerced to either factor or character depenging on the read.table setting because the last element in the last line is
> temp3[43, 581]
[1] "-999</pre></body>"
Several ways of handling that. Pick one. Doesn't really matter much since the entire column is -999 anyway.
Upvotes: 2
Reputation: 32986
I got this far. If someone can go further, please edit my answer.
library(RCurl)
library(XML)
library(stringr)
# grab the content. can't use readHTMLTable since this isn't one.
htmldata <- getURL('http://www.nwrfc.noaa.gov/esp/esp2csv.cgi?id=slkw1')
# grab data between the pre tags
dataonly <- substr(htmldata, str_locate(data, "<pre>")[2]+1 ,
str_locate(data, "</pre>")[1]-1)
# replace <br> tags with new lines.
dataonly2 <- str_replace(dataonly, "<br />", "\n")
Upvotes: 0