Amit
Amit

Reputation:

Extracting meta tags attribute using wget

I have a file having some URLs per line. I need to extract the "keywords" present in the tags i.e. if there is meta tag for "keywords" then i want to get "content" value for it. Example: if the web-page has this meta-tag:

<meta name="keywords" content="wikipedia,encyclopedia">

then for that URL i want "wikipedia,encyclopedia" to be extracted.

One approach is to download the web-page using "wget" and then parse it using some standard HTML parser.

I was wondering is there any better way to do this without downloading the entire web-page.

Upvotes: 2

Views: 1827

Answers (3)

surfealokesea
surfealokesea

Reputation: 5116

Here you have another solution:

http://simplehtmldom.sourceforge.net

I didn't try it yet!

Upvotes: 0

Su&#39;
Su&#39;

Reputation: 2166

If you're comfortable with some PHP, you should be able to put something together pretty easily by wrapping a loop around QueryPath.

Swiping an example from the docs, this:

require 'QueryPath/QueryPath.php';

$url = 'http://example.com';
print qp($url, 'title')->text();

...will go out and get the document at example.com, extract the text of the title tag and output it.
It'd only take a little more work to make that look for meta keywords tags and extract the content attribute, especially if you're already familiar with jQuery. (It's a bit of a simplification, but a large chunk of QueryPath is more or less implementing a "server-side jQuery.")

If you pursue this programmatic method and have further questions, they should probably go on the main Stack Overflow site where there's also an active querypath tag.

Upvotes: 0

LazyOne
LazyOne

Reputation: 165188

No -- you have to download the whole page .. or interrupt downloading after receiving some amount of data (which is even worse and much more complicated to do as AFAIK it cannot be done with wget and you will have to code your own wget).

Upvotes: 0

Related Questions