user9951600
user9951600

Reputation: 13

readHTMLTable lost rows

I try to use readHTMLTable to read

https://ows.doleta.gov/unemploy/trigger/2002/trig_101302.html

(After downloaded it in a .html file, but I don't know how to upload it here. The download code is download.file(FullURL,filename1) )

The resulting table lost many rows, a lot of states disappeared. Say, the resulting list only has 40 rows, while there are 53 states and a lot of non-state rows in the html file. I tried the header thing, but it does not work.

Any suggestion to help get a full table with all states in it is much appreciated.

Upvotes: 1

Views: 121

Answers (2)

Henry
Henry

Reputation: 6784

Once you have got over the https problem with readHTMLtable by downloading the page, this worked for me; everything is a character string and the dates start with E or B, but it at least gives you the information and you can start cleaning

library(XML)
doc <- "C:/Users/ME/Desktop/trig_101302.html"
tab <- readHTMLTable(doc, stringsAsFactors=FALSE)[[1]][6:58,] 
tab

gives the table below

With the second URL you mention, the number of rows at the top changes and one of the columns seems to disappear (and something happens to Puerto Rico), so something more general

library(XML)
doc="C:/Users/ME/Desktop/trig_123117.html"
tab <- readHTMLTable(doc, stringsAsFactors=FALSE)[[1]]
tab <- tab[which(tab[[4]]=="Alabama"):which(tab[[4]]=="Wyoming"), ]

might work instead; a remaining risk in future is that the state names move from the fourth column


   V1 V2 V3                      V4 V5   V6  V7   V8  V9 V10 V11          V12
6   Â  Â  &                 Alabama  Â 2.17 118  5.6 105 121   Â E 06-04-1983
7   Â  Â  Â                  Alaska  Â 3.63 109  6.9 109 106   Â E 06-01-2002
8   Â  Â  &                 Arizona  Â 1.97 134  5.9 128 151   Â E 10-23-1982
9   Â  Â  &                Arkansas  Â 2.87 117  5.1  98 115   Â E 03-26-1983
10  Â  Â  &              California  Â 3.38 137  6.4 120 128   Â E 07-09-1983
11  Â  Â  &                Colorado  Â 1.69 182  5.1 141 182   Â E 01-24-1981
12  Â  Â  Â             Connecticut  Â 3.08 150  3.8 111 180   Â E 01-24-1981
13  Â  *  &                Delaware  Â 2.02 133  4.0 121  97   Â E 07-17-1982
14  Â  Â  &         District of Col  Â 1.62 122  6.0  89 107   Â E 01-24-1981
15  Â  *  &                 Florida  Â 1.88 131  5.3 112 147   Â E 01-24-1981
16  Â  *  &                 Georgia  Â 1.81 139  4.7 117 127   Â E 01-24-1981
17  Â  Â  &                  Hawaii  Â 1.93 107  4.1  93 100   Â E 01-24-1981
18  Â  Â  &                   Idaho  Â 2.31 125  5.3 108 110   Â E 06-22-2002
19  Â  Â  &                Illinois  Â 2.80 141  6.4 118 148   Â E 06-25-1983
20  Â  Â  &                 Indiana  Â 1.79 135  5.1 115 154   Â E 04-30-1983
21  Â  *  &                    Iowa  Â 1.75 138  3.8 115 146   Â E 06-04-1983
22  Â  Â  Â                  Kansas  Â 2.04 160  4.5 104 118   Â E 11-06-1982
23  Â  *  &                Kentucky  Â 2.13 125  5.2  92 126   Â E 03-19-1983
24  Â  Â  &               Louisiana  Â 1.93 132  5.9 103 105   Â E 03-14-1987
25  Â  Â  &                   Maine  Â 1.53 119  4.1 100 120   Â E 06-25-1994
26  Â  Â  &                Maryland  Â 2.00 139  4.2 102 105   Â E 07-31-1982
27  Â  *  &           Massachusetts  Â 3.20 137  5.0 131 192   Â E 06-29-1991
28  Â  Â  &                Michigan  Â 2.79 134  6.5 122 180   Â E 06-15-1991
29  Â  Â  &               Minnesota  Â 1.81 145  4.2 113 127   Â E 06-18-1983
30  Â  Â  &             Mississippi  Â 2.47 118  6.4 120 110   Â E 07-16-1983
31  Â  Â  &                Missouri  Â 2.40 126  5.1 108 150   Â E 06-19-1982
32  Â  Â  &                 Montana  Â 1.67 105  4.4  97  89   Â E 06-04-1983
33  Â  Â  &                Nebraska  Â 1.32 149  3.5 112 116   Â E 01-24-1981
34  Â  Â  &                  Nevada  Â 2.58 123  5.3 103 135   Â E 03-19-1983
35  Â  *  Â New \r\n      Hampshire  Â 1.45 165  4.5 121 155   Â E 01-24-1981
36  Â  Â  &              New Jersey  Â 3.40 130  5.4 128 145   Â E 06-19-1982
37  Â  Â  &              New Mexico  Â 1.94 139  6.2 131 129   Â E 11-27-1982
38  Â  Â  &                New York  Â 2.75 132  6.0 125 133   Â E 01-24-1981
39  @  Â  Â          North Carolina  Â 2.58 138  6.6 117 183  13 B 06-02-2002
40  Â  *  &            North Dakota  Â 0.96 121  3.3 117 110   Â E 06-11-1983
41  Â  Â  &                    Ohio  Â 1.98 130  5.6 130 136   Â E 05-14-1983
42  Â  Â  &                Oklahoma  Â 1.70 154  4.3 110 138   Â E 01-24-1981
43  @  Â  Â                  Oregon  Â 3.68 127  7.2 112 150  13 B 01-06-2002
44  Â  Â  &            Pennsylvania  Â 3.44 130  5.4 112 128   Â E 08-06-1983
45  Â  Â  &             Puerto Rico  Â 5.76 109 12.7 111 125   Â E 01-01-2000
46  Â  Â  Â            Rhode Island  Â 2.95 107  4.4  91 107   Â E 07-08-1995
47  Â  Â  &          South Carolina  Â 2.56 121  5.4  96 142   Â E 03-19-1983
48  Â  *  &            South Dakota  Â 0.70 137  2.8  82 127   Â E 01-24-1981
49  Â  Â  &               Tennessee  Â 2.07 108  4.8 106 123   Â E 09-25-1982
50  Â  Â  &                   Texas  Â 2.15 154  6.0 122 142   Â E 01-24-1981
51  Â  *  &                    Utah  Â 1.77 158  4.9 113 153   Â E 06-25-1983
52  Â  Â  Â                 Vermont  Â 2.30 164  4.0 111 137   Â E 07-13-1991
53  Â  Â  &          Virgin Islands  Â 2.84 240  5.6 186 280   Â E 08-06-1983
54  Â  Â  &                Virginia  Â 1.48 175  4.0 114 181   Â E 01-24-1981
55  @  *  Â              Washington  Â 3.61 133  7.0 109 134  13 B 01-06-2002
56  Â  Â  & West \r\n      Virginia  Â 2.31 126  6.2 126 112   Â E 07-13-1991
57  Â  Â  &               Wisconsin  Â 2.66 133  5.0 108 135   Â E 06-18-1983
58  Â  *  &                 Wyoming  Â 1.19 143  3.9  97 100   Â E 06-13-1987

Upvotes: 0

alistaire
alistaire

Reputation: 43344

That table is pretty appallingly structured inside, so you're going to need to do some work to extract it into a proper data frame. Using rvest for scraping and base R for cleanup (there are lots of helpful alternative packages, if you like),

library(rvest)

# scrape HTML
h <- read_html('https://ows.doleta.gov/unemploy/trigger/2002/trig_101302.html')

df <- h %>% 
  html_node('table') %>%    # select <table> HTML node
  html_table(fill = TRUE) %>%    # extract table from HTML to data frame
  head(57)    # omit end matter

# fix names to the point of R legality
names(df) <- make.names(gsub('\\s+', '.', 
                             sapply(head(df, 4), 
                                    paste, collapse = '.')), 
                        unique = TRUE)

df <- df[-1:-4, ]    # remove rows with names
df[] <- lapply(df, type.convert, as.is = TRUE)    # coerce to appropriate types

str(df)
#> 'data.frame':    53 obs. of  12 variables:
#>  $ ...                                     : chr  "" "" "" "" ...
#>  $ ....1                                   : chr  "" "" "" "" ...
#>  $ ....2                                   : chr  "&" "" "&" "&" ...
#>  $ ....3                                   : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
#>  $ INDICATORS...                           : logi  NA NA NA NA NA NA ...
#>  $ INDICATORS.13WeekIUR..                  : num  2.17 3.63 1.97 2.87 3.38 1.69 3.08 2.02 1.62 1.88 ...
#>  $ INDICATORS.Pct.ofPrior2.Yrs..           : int  118 109 134 117 137 182 150 133 122 131 ...
#>  $ INDICATORS.3.moSATUR..                  : num  5.6 6.9 5.9 5.1 6.4 5.1 3.8 4 6 5.3 ...
#>  $ INDICATORS.Pct.of.prior.Year.           : int  105 109 128 98 120 141 111 121 89 112 ...
#>  $ INDICATORS.Pct.of.prior.2ndYear.        : int  121 106 151 115 128 182 180 97 107 147 ...
#>  $ INDICATORS.Pct.of.prior.Avail.WKS.      : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ STATUS.Periods.Begin.Date.B.End.Date.E..: chr  "E 06-04-1983" "E 06-01-2002" "E 10-23-1982" "E 03-26-1983" ...

It's still not great, but it can be cleaned up more easily from this point.

Upvotes: 1

Related Questions