austin
austin

Reputation: 117

Parsing a CSV HTML stream in R

I am trying to parse an html page that is displayed in csv format.

I would like to read this table on the fly of the page directly into R (http://www.nwrfc.noaa.gov/esp/esp2csv.cgi?id=slkw1) and then convert the data into a matrix for further crunching.

Not sure if RCurl or XML packages are the best way to do it but have not been able to find a good way to parse the object once I get it into R.

Upvotes: 1

Views: 860

Answers (2)

IRTFM
IRTFM

Reputation: 263382

> con=url("http://www.nwrfc.noaa.gov/esp/esp2csv.cgi?id=slkw1")
> table(count.fields(con, sep=","))

    1 25521 
    2     1 

So there are two lines with one field only ... turns out that skipping two lines lets the data read succeed.

> con=url("http://www.nwrfc.noaa.gov/esp/esp2csv.cgi?id=slkw1")
> temp <- scan(file=con,  skip=2, what="a", sep=",") 
# paged through the scan results
> table(grepl("<br />", temp))

FALSE  TRUE 
25478    43

> con=url("http://www.nwrfc.noaa.gov/esp/esp2csv.cgi?id=slkw1")
> temp <- readLines(con=con)
> temp2=gsub("<br />", "\n", temp)
# Knew that the first two lines were HTML junk.
> temp3 = read.table( text=temp2[3], sep= ",", fill =TRUE, header=TRUE)
> str(temp3)
'data.frame':   43 obs. of  581 variables:
 $ X.pre.STA.ID: Factor w/ 1 level "SLKW1": 1 1 1 1 1 1 1 1 1 1 ...
 $ START.YEAR  : int  1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 ...
 $ X9.1        : int  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ X9.2        : int  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ X9.3        : int  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ X9.4        : int  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ X9.5        : int  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ X9.6        : int  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...

snipped  output  There is better data below

 $ X9.18       : int  41 41 41 41 41 41 41 41 41 41 ...
 $ X9.19       : int  41 41 41 41 41 41 41 41 41 41 ...
 $ X9.20       : int  40 40 40 40 40 40 40 40 40 40 ...
 $ X9.21       : int  39 39 39 39 39 39 39 39 39 39 ...
 $ X9.22       : int  39 39 39 39 39 39 39 39 39 39 ...
 $ X9.23       : int  39 39 39 39 39 39 39 39 39 39 ...
 $ X9.24       : int  38 38 38 38 38 38 38 38 38 38 ...
 $ X9.25       : int  38 38 38 38 38 38 38 38 38 38 ...
 $ X9.26       : int  38 38 38 38 38 38 38 38 38 38 ...
 $ X9.27       : int  38 38 38 38 38 38 38 38 38 38 ...
 $ X9.28       : int  48 53 106 49 89 48 94 117 49 49 ...
 $ X9.29       : int  57 45 436 40 403 38 145 294 40 40 ...
 $ X9.30       : int  259 34 1270 37 1622 33 80 432 37 37 ...

The last column gets coerced to either factor or character depenging on the read.table setting because the last element in the last line is

>  temp3[43, 581]
[1] "-999</pre></body>"

Several ways of handling that. Pick one. Doesn't really matter much since the entire column is -999 anyway.

Upvotes: 2

Maiasaura
Maiasaura

Reputation: 32986

I got this far. If someone can go further, please edit my answer.

library(RCurl)
library(XML)
library(stringr)
# grab the content. can't use readHTMLTable since this isn't one.
htmldata <- getURL('http://www.nwrfc.noaa.gov/esp/esp2csv.cgi?id=slkw1')
# grab data between the pre tags
dataonly <- substr(htmldata, str_locate(data, "<pre>")[2]+1 ,
str_locate(data, "</pre>")[1]-1)
# replace <br> tags with new lines.
dataonly2 <- str_replace(dataonly, "<br />", "\n")

Upvotes: 0

Related Questions