user1471980
user1471980

Reputation: 10626

How do you read an html table in R

I'm trying to read and html page that has a table like this:

url<-c("example.com/table")

content of the page is like this:

<html> <head> <meta http-equiv="Refresh" content="60, cecutil2.cgi?V1P1"/> <style type="text/css"> th,td {font-size:12px; height:14px} table {border-collapse:collapse;} </style> </head> <body> <table width="100%" cols="6" border="2px">
<tr><td style="width:50px; text-align:center"><a href="zview.html" target="_blank" title="v1p1.mf.example.com:1024 (10.175.128.62)">V1P1</a></td><th width="55px">15/06/22</th><th width="40px">12:21</th>
<th width="65px">3B0F97-0</th><th style="width:50px">19.11%</th> <td align="left"><img height="100%" src="grnline.gif" width="19.10954%"></td></tr>
<tr><td colspan="3"></td>
<th width="65px">3B0F97-1</th><th style="width:50px">20.69%</th> <td align="left"><img height="100%" src="grnline.gif" width="20.68669%"></td></tr>
<tr><td colspan="3"></td>
<th width="65px">3B0F97-2</th><th style="width:50px">22.14%</th> <td align="left"><img height="100%" src="grnline.gif" width="22.13768%"></td></tr>
<tr><td colspan="3"></td>
<th width="65px">3B0F97-3</th><th style="width:50px">35.25%</th> <td align="left"><img height="100%" src="grnline.gif" width="35.24557%"></td></tr>
<tr><td colspan="3"></td>
<th width="65px">3B0F97-4</th><th style="width:50px">38.04%</th> <td align="left"><img height="100%" src="grnline.gif" width="38.03592%"></td></tr>
<tr><td colspan="3"></td>
<th width="65px">3B0F97-5</th><th style="width:50px">23.73%</th> <td align="left"><img height="100%" src="grnline.gif" width="23.73468%"></td></tr>
<tr><td colspan="3"></td>
<th width="65px"> Total </th><th style="width:50px">659.55%</th> <td align="left"><img height="100%" src="grnline.gif" width="32.977437%"></td></tr>
<tr><th colspan="6" align="center">Linux Nodes (z/VM-Guests)</th></tr> <tr><td/><td colspan="5"><table cols="3" width="100%" border="1px">
<tr><th width="50px"> <a href="http://jassadmin.nj.example.com/adminsvcs/jassgleprocess.jsp?search_criteria=jas1a419" target="_blank">jas1a419</a></th><th style="width:50px">121.58%</th> <td align="left"><img height="100%" src="redline.png" width="100%"></td></tr>
<tr><th width="50px"> <a href="http://jassadmin.nj.example.com/adminsvcs/jassgleprocess.jsp?search_criteria=jas1a443" target="_blank">jas1a443</a></th><th style="width:50px">45.07%</th> <td align="left"><img height="100%" src="grnline.gif" width="45.07199%"></td></tr>
<tr><th width="50px"> <a href="http://jassadmin.nj.example.com/adminsvcs/jassgleprocess.jsp?search_criteria=jas1a185" target="_blank">jas1a185</a></th><th style="width:50px">36.53%</th> <td align="left"><img height="100%" src="grnline.gif" width="36.52853%"></td></tr>
<tr><th width="50px"> <a href="http://jassadmin.nj.example.com/adminsvcs/jassgleprocess.jsp?search_criteria=jas1a435" target="_blank">jas1a435</a></th><th style="width:50px">23.19%</th> <td align="left"><img height="100%" src="grnline.gif" width="23.18803%"></td></tr>
</td></tr></table>
</table></body></html>

I need to be able to read this data into two different data frames.

1st data frame should have data until a text shows as Linux Nodes, like this:

15/06/22    12:27       3B0F97-0    14.97%  
15/06/22    12:27       3B0F97-1    16.10%  
15/06/22    12:27       3B0F97-2    18.30%  
15/06/22    12:27       3B0F97-3    39.23%

and

2nd data frame should have the data in there from Linux Nodes below with the same time frame as above data frame.

15/06/22    12:27   jas1a419 121.58%  
15/06/22    12:27   jas1a443 45.07%  
15/06/22    12:27   jas1a185 36.53%  
15/06/22    12:27   jas1a435 23.19% 

When I do:

suppressMessages(library(RCurl))
suppressMessages(library(XML))
data<-readHTMLTable(url)

I dont get the time in the row? Any ideas, how I could modify this so that I could get two different data frames with times in them?

Upvotes: 0

Views: 1055

Answers (2)

RHertel
RHertel

Reputation: 23788

Your HTML file apparently does not contain explicitly the data that you're trying to retrieve. I'm not an expert in this, but maybe the information on the dates and times is provided by JavaServer Pages called from within the HTML code.

In any case, this is how your table looks in my browser:

This is how the HTML table looks like without the the jsp data

OK, if this is your original HTML file, then I suggest the following:

library(rvest)
q <- html(url) %>% html_table(fill=T) 
t2 <- as.data.frame(q[[2]][-3])
t1 <- as.data.frame(q[[1]])
ln_row <- as.numeric(row.names(t1[grep("Linux Nodes",t1[,1]),]))
t1 <- t1[-c((ln_row-1):nrow(t1)),-c(6:ncol(t1))]
t1[,2] <- t1[1,2] # fill the second column of table 1 with the date
t1[,3] <- t1[1,3] # fill the third column of table 1 with the time
t2[,3] <- t1[1,2] # copy date from table 1 into table 2
t2[,4] <- t1[1,3] # copy time from table 1 into table 2
t2 <- t2[c(3,4,1,2)] #reorder columns
#> t1
#    X1       X2    X3       X4      X5
#1 V1P1 15/06/22 12:21 3B0F97-0  19.11%
#2      15/06/22 12:21 3B0F97-1  20.69%
#3      15/06/22 12:21 3B0F97-2  22.14%
#4      15/06/22 12:21 3B0F97-3  35.25%
#5      15/06/22 12:21 3B0F97-4  38.04%
#6      15/06/22 12:21 3B0F97-5  23.73%

#> t2
#        V3    V4       X1      X2
#1 15/06/22 12:21 jas1a419 121.58%
#2 15/06/22 12:21 jas1a443  45.07%
#3 15/06/22 12:21 jas1a185  36.53%
#4 15/06/22 12:21 jas1a435  23.19%

Hope this helps.

Upvotes: 1

Rodrigo
Rodrigo

Reputation: 5129

Try XML package, function readHTMLTable(). If your (strange) table may be read into a data.frame, then it'll be easier to select what you want.

Upvotes: 1

Related Questions