ByulTaeng
ByulTaeng

Reputation: 1269

Extract HTML content using YQL?

Let say I want to extract data from a web page with the following markup:

<table>
  <tr>
    <td><a href="Link 1">Column 1 Text</a></td>
    <td>Column 2 Text</td>
    <td>Column 3 Text</td>
  </tr>
  <tr>
    <td><a href="Link 2">Column 1 Text</a></td>
    <td>Column 2 Text</td>
    <td>Column 3 Text</td>
  </tr>
  ...
</table>

to JSON format :

[
  {
    link: 'Link 1',
    text: 'Column 1 Text',
    data: 'Column 3 Text'
  },
  {
    link: 'Link 2',
    text: 'Column 1 Text',
    data: 'Column 3 Text'
  }
]

Can we make it with YQL? If yes then please give me an example query.

Any helps would be appreciated!

Upvotes: 1

Views: 1111

Answers (1)

BrianC
BrianC

Reputation: 10721

Here's a query that's a good starting point, using the HTML table along with some XPath query (see Extracting HTML Content With XPath for more details on this technique):

select * from html where url="http://cantoni.org/test/table.html" and xpath='//table/tr'

Which produces JSON results like this:

{
 "query": {
  "count": 2,
  "created": "2012-01-06T20:16:46Z",
  "lang": "en-US",
  "results": {
   "tr": [
    {
     "td": [
      {
       "a": {
        "href": "Link%201",
        "content": "Column 1 Text"
       }
      },
      {
       "p": "Column 2 Text"
      },
      {
       "p": "Column 3 Text"
      }
     ]
    },
    {
     "td": [
      {
       "a": {
        "href": "Link%202",
        "content": "Column 1 Text"
       }
      },
      {
       "p": "Column 2 Text"
      },
      {
       "p": "Column 3 Text"
      }
     ]
    }
   ]
  }
 }
}

Upvotes: 1

Related Questions