user4685747
user4685747

Reputation: 21

Extracting information from XML using R

I am trying to extract the date information from the following html code using R and xpathSApply:

                                                            </td>
                    </tr>
                                            <tr
                        data-row-id="1363827503"
                                                        class="future "
                        data-lat-from="-33.946098"
                        data-lon-from="151.1772"
                        data-lat-to="33.94252"
                        data-lon-to="-118.406998"
                        data-name-from="Sydney Kingsford Smith Airport"
                        data-name-to="Los Angeles International Airport"
                        data-date="2015-03-23"
                        data-flight=""
                        data-flight-number="VA1"
                    >

Here is the code in R I have tried:

library(XML)
url<- "http://www.flightradar24.com/data/flights/va1/"
info<- htmlTreeParse(url, useInternalNodes=T)
xpathSApply(info, "//data-date", xmlValue)

This returns: list()

I would like it to return: 2015-03-23

Upvotes: 2

Views: 153

Answers (1)

Mathias M&#252;ller
Mathias M&#252;ller

Reputation: 22617

This is the part of the document you are interested in:

<tr
    data-row-id="1363827503"
    class="future "
    data-lat-from="-33.946098"
    data-lon-from="151.1772"
    data-lat-to="33.94252"
    data-lon-to="-118.406998"
    data-name-from="Sydney Kingsford Smith Airport"
    data-name-to="Los Angeles International Airport"
    data-date="2015-03-23"
    data-flight=""
    data-flight-number="VA1"
>

As you can see, data-date is not an element, it is an attribute of a tr element. Use //tr/@data-date as the XPath expression to retrieve the data-date attribute.

But note that there are multiple data-date attributes in this document. To only retrieve a single attribute, you also need a way to identify a specific row, for instance with

//tr[@data-row-id="1363827503"]/@data-date

The ID 1363827503 occurs only once in this document and is therefore a unique identifier for this tr element.

Upvotes: 2

Related Questions