Reputation: 728
<TABLE cellspacing=1 cellpadding=7 rules=all frame=Box border=1>
<thead>
<TR>
<TD ROWSPAN=2 ALIGN=CENTER VALIGN=CENTER> </TD>
<TD COLSPAN=6 ALIGN=CENTER>1a. My peers make a positive impact my work environment.</TD>
<TD ALIGN=CENTER>Number</TD>
</TR>
<TR>
<TD ALIGN=CENTER>Strongly agree <br> </TD>
<TD ALIGN=CENTER>Generally agree <br> </TD>
<TD ALIGN=CENTER>Neither agree nor<br>disagree</TD>
<TD ALIGN=CENTER>Generally disagree<br> </TD>
<TD ALIGN=CENTER>Strongly disagree<br> </TD>
<TD ALIGN=CENTER>No basis to judge<br> </TD>
<TD ALIGN=CENTER>of Cases</TD>
</TR>
</thead>
<tbody>
<TR>
<TD ALIGN=LEFT VALIGN=TOP> Company-Wide </TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 44.1</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 44.9</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 6.6</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 2.6</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 1.6</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 0.1</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 2,014</TD>
</TR>
<TR>
<TD ALIGN=LEFT VALIGN=TOP> Region 1 </TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 45.6</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 45.2</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 5.7</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 2.1</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 1.4</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 0.1</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 1,699</TD>
</TR>
<TR>
<TD ALIGN=LEFT VALIGN=TOP>Division 1 </TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 52.9</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 39.7</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 4.1</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 2.5</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 0.8</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM>0</TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM> 121</TD>
</TR>
</tbody>
</TABLE>
<hr><A NAME="IDX1"> </A>
I have an HTML file that contains several tables of the sort above. I would like to convert them into a data frame where each survey question, currently in the table header, would appear in a column. The percent responding to each question would remain in a column, as would the response levels. Not all questions have the same number of responses (i.e. some are on a five point scale, others are on a nine point scale). I tried readHTMLTable and then do.call rbind on that result, but cannot obtain the data frame of interest because the number of columns is not identical. I welcome any advice on how to proceed. thanks!
edit:
library(xml)
library(dplyr)
questions<-readHTMLTable(files[8], trim=T, as.data.frame=T, header=T)
data<-bind_rows(questions)
Results in the data frame I want, but because some questions have more response levels than others, the "number of cases" data does not consistently appear in one column. Is there a way for me to name the last column of each table before merging?
Upvotes: 10
Views: 15174
Reputation: 23788
You can use the rvest
package for this. However, it might be necessary to pay attention to column names with white spaces. I used the option fill=TRUE
as a quick fix, but maybe this can be done in a better way.
library(rvest)
my_df <- as.data.frame(read_html(text) %>% html_table(fill=TRUE))
> my_df
# X1 X2 X3 X4 X5 X6 X7 X8
#1 1a. My peers make a positive impact my work environment. <NA> <NA> <NA> <NA> <NA> Number
#2 Strongly agree Generally agree Neither agree nordisagree Generally disagree Strongly disagree No basis to judge of Cases <NA>
#3 Company-Wide 44.1 44.9 6.6 2.6 1.6 0.1 2,014
#4 Region 1 45.6 45.2 5.7 2.1 1.4 0.1 1,699
#5 Division 1 52.9 39.7 4.1 2.5 0.8 0 121
Concerning the data, I copy-pasted the html code from the OP and assigned it to the variable text
with text <- '<TABLE cellspacing=1 cellpadding=7 rules=all frame=...'
, using single quotation marks.
Some details of the format can be corrected afterwards in a rather simple way:
my_df[2,] <- c("",my_df[2,][-length(my_df)])
#> my_df
# X1 X2 X3 X4 X5 X6 X7 X8
#1 1a. My peers make a positive impact my work environment. <NA> <NA> <NA> <NA> <NA> Number
#2 Strongly agree Generally agree Neither agree nordisagree Generally disagree Strongly disagree No basis to judge of Cases
#3 Company-Wide 44.1 44.9 6.6 2.6 1.6 0.1 2,014
#4 Region 1 45.6 45.2 5.7 2.1 1.4 0.1 1,699
#5 Division 1 52.9 39.7 4.1 2.5 0.8 0 121
Essentially, in this case the entries of the second row should be shifted to the right by one cell.
data
text <- '<TABLE cellspacing=1 cellpadding=7 rules=all frame=Box border=1>\n <thead>\n <TR>\n <TD ROWSPAN=2 ALIGN=CENTER VALIGN=CENTER> </TD>\n <TD COLSPAN=6 ALIGN=CENTER>1a. My peers make a positive impact my work environment.</TD>\n <TD ALIGN=CENTER>Number</TD>\n </TR>\n <TR>\n <TD ALIGN=CENTER>Strongly agree <br> </TD>\n <TD ALIGN=CENTER>Generally agree <br> </TD>\n <TD ALIGN=CENTER>Neither agree nor<br>disagree</TD>\n <TD ALIGN=CENTER>Generally disagree<br> </TD>\n <TD ALIGN=CENTER>Strongly disagree<br> </TD>\n <TD ALIGN=CENTER>No basis to judge<br> </TD>\n <TD ALIGN=CENTER>of Cases</TD>\n </TR>\n </thead>\n <tbody>\n <TR>\n <TD ALIGN=LEFT VALIGN=TOP> Company-Wide </TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 44.1</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 44.9</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 6.6</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 2.6</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 1.6</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 0.1</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 2,014</TD>\n </TR>\n <TR>\n <TD ALIGN=LEFT VALIGN=TOP> Region 1 </TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 45.6</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 45.2</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 5.7</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 2.1</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 1.4</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 0.1</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 1,699</TD>\n </TR>\n <TR>\n <TD ALIGN=LEFT VALIGN=TOP>Division 1 </TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 52.9</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 39.7</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 4.1</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 2.5</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 0.8</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM>0</TD>\n <TD ALIGN=RIGHT VALIGN=BOTTOM> 121</TD>\n </TR>\n </tbody>\n </TABLE>\n <hr><A NAME=\"IDX1\"> </A>'
#> class(text)
#[1] "character"
Upvotes: 16