Reputation: 13
I try to use readHTMLTable to read
https://ows.doleta.gov/unemploy/trigger/2002/trig_101302.html
(After downloaded it in a .html file, but I don't know how to upload it here. The download code is download.file(FullURL,filename1) )
The resulting table lost many rows, a lot of states disappeared. Say, the resulting list only has 40 rows, while there are 53 states and a lot of non-state rows in the html file. I tried the header thing, but it does not work.
Any suggestion to help get a full table with all states in it is much appreciated.
Upvotes: 1
Views: 121
Reputation: 6784
Once you have got over the https
problem with readHTMLtable
by downloading the page, this worked for me; everything is a character string and the dates start with E
or B
, but it at least gives you the information and you can start cleaning
library(XML)
doc <- "C:/Users/ME/Desktop/trig_101302.html"
tab <- readHTMLTable(doc, stringsAsFactors=FALSE)[[1]][6:58,]
tab
gives the table below
With the second URL you mention, the number of rows at the top changes and one of the columns seems to disappear (and something happens to Puerto Rico), so something more general
library(XML)
doc="C:/Users/ME/Desktop/trig_123117.html"
tab <- readHTMLTable(doc, stringsAsFactors=FALSE)[[1]]
tab <- tab[which(tab[[4]]=="Alabama"):which(tab[[4]]=="Wyoming"), ]
might work instead; a remaining risk in future is that the state names move from the fourth column
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
6   & Alabama  2.17 118 5.6 105 121  E 06-04-1983
7    Alaska  3.63 109 6.9 109 106  E 06-01-2002
8   & Arizona  1.97 134 5.9 128 151  E 10-23-1982
9   & Arkansas  2.87 117 5.1 98 115  E 03-26-1983
10   & California  3.38 137 6.4 120 128  E 07-09-1983
11   & Colorado  1.69 182 5.1 141 182  E 01-24-1981
12    Connecticut  3.08 150 3.8 111 180  E 01-24-1981
13  * & Delaware  2.02 133 4.0 121 97  E 07-17-1982
14   & District of Col  1.62 122 6.0 89 107  E 01-24-1981
15  * & Florida  1.88 131 5.3 112 147  E 01-24-1981
16  * & Georgia  1.81 139 4.7 117 127  E 01-24-1981
17   & Hawaii  1.93 107 4.1 93 100  E 01-24-1981
18   & Idaho  2.31 125 5.3 108 110  E 06-22-2002
19   & Illinois  2.80 141 6.4 118 148  E 06-25-1983
20   & Indiana  1.79 135 5.1 115 154  E 04-30-1983
21  * & Iowa  1.75 138 3.8 115 146  E 06-04-1983
22    Kansas  2.04 160 4.5 104 118  E 11-06-1982
23  * & Kentucky  2.13 125 5.2 92 126  E 03-19-1983
24   & Louisiana  1.93 132 5.9 103 105  E 03-14-1987
25   & Maine  1.53 119 4.1 100 120  E 06-25-1994
26   & Maryland  2.00 139 4.2 102 105  E 07-31-1982
27  * & Massachusetts  3.20 137 5.0 131 192  E 06-29-1991
28   & Michigan  2.79 134 6.5 122 180  E 06-15-1991
29   & Minnesota  1.81 145 4.2 113 127  E 06-18-1983
30   & Mississippi  2.47 118 6.4 120 110  E 07-16-1983
31   & Missouri  2.40 126 5.1 108 150  E 06-19-1982
32   & Montana  1.67 105 4.4 97 89  E 06-04-1983
33   & Nebraska  1.32 149 3.5 112 116  E 01-24-1981
34   & Nevada  2.58 123 5.3 103 135  E 03-19-1983
35  *  New \r\n Hampshire  1.45 165 4.5 121 155  E 01-24-1981
36   & New Jersey  3.40 130 5.4 128 145  E 06-19-1982
37   & New Mexico  1.94 139 6.2 131 129  E 11-27-1982
38   & New York  2.75 132 6.0 125 133  E 01-24-1981
39 @   North Carolina  2.58 138 6.6 117 183 13 B 06-02-2002
40  * & North Dakota  0.96 121 3.3 117 110  E 06-11-1983
41   & Ohio  1.98 130 5.6 130 136  E 05-14-1983
42   & Oklahoma  1.70 154 4.3 110 138  E 01-24-1981
43 @   Oregon  3.68 127 7.2 112 150 13 B 01-06-2002
44   & Pennsylvania  3.44 130 5.4 112 128  E 08-06-1983
45   & Puerto Rico  5.76 109 12.7 111 125  E 01-01-2000
46    Rhode Island  2.95 107 4.4 91 107  E 07-08-1995
47   & South Carolina  2.56 121 5.4 96 142  E 03-19-1983
48  * & South Dakota  0.70 137 2.8 82 127  E 01-24-1981
49   & Tennessee  2.07 108 4.8 106 123  E 09-25-1982
50   & Texas  2.15 154 6.0 122 142  E 01-24-1981
51  * & Utah  1.77 158 4.9 113 153  E 06-25-1983
52    Vermont  2.30 164 4.0 111 137  E 07-13-1991
53   & Virgin Islands  2.84 240 5.6 186 280  E 08-06-1983
54   & Virginia  1.48 175 4.0 114 181  E 01-24-1981
55 @ *  Washington  3.61 133 7.0 109 134 13 B 01-06-2002
56   & West \r\n Virginia  2.31 126 6.2 126 112  E 07-13-1991
57   & Wisconsin  2.66 133 5.0 108 135  E 06-18-1983
58  * & Wyoming  1.19 143 3.9 97 100  E 06-13-1987
Upvotes: 0
Reputation: 43344
That table is pretty appallingly structured inside, so you're going to need to do some work to extract it into a proper data frame. Using rvest for scraping and base R for cleanup (there are lots of helpful alternative packages, if you like),
library(rvest)
# scrape HTML
h <- read_html('https://ows.doleta.gov/unemploy/trigger/2002/trig_101302.html')
df <- h %>%
html_node('table') %>% # select <table> HTML node
html_table(fill = TRUE) %>% # extract table from HTML to data frame
head(57) # omit end matter
# fix names to the point of R legality
names(df) <- make.names(gsub('\\s+', '.',
sapply(head(df, 4),
paste, collapse = '.')),
unique = TRUE)
df <- df[-1:-4, ] # remove rows with names
df[] <- lapply(df, type.convert, as.is = TRUE) # coerce to appropriate types
str(df)
#> 'data.frame': 53 obs. of 12 variables:
#> $ ... : chr "" "" "" "" ...
#> $ ....1 : chr "" "" "" "" ...
#> $ ....2 : chr "&" "" "&" "&" ...
#> $ ....3 : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
#> $ INDICATORS... : logi NA NA NA NA NA NA ...
#> $ INDICATORS.13WeekIUR.. : num 2.17 3.63 1.97 2.87 3.38 1.69 3.08 2.02 1.62 1.88 ...
#> $ INDICATORS.Pct.ofPrior2.Yrs.. : int 118 109 134 117 137 182 150 133 122 131 ...
#> $ INDICATORS.3.moSATUR.. : num 5.6 6.9 5.9 5.1 6.4 5.1 3.8 4 6 5.3 ...
#> $ INDICATORS.Pct.of.prior.Year. : int 105 109 128 98 120 141 111 121 89 112 ...
#> $ INDICATORS.Pct.of.prior.2ndYear. : int 121 106 151 115 128 182 180 97 107 147 ...
#> $ INDICATORS.Pct.of.prior.Avail.WKS. : int NA NA NA NA NA NA NA NA NA NA ...
#> $ STATUS.Periods.Begin.Date.B.End.Date.E..: chr "E 06-04-1983" "E 06-01-2002" "E 10-23-1982" "E 03-26-1983" ...
It's still not great, but it can be cleaned up more easily from this point.
Upvotes: 1