Amit
Amit

Reputation: 604

Extracting meta tags attribute using wget

I have a file having some URLs per line. I need to extract the "keywords" present in the tags i.e. if there is meta tag for "keywords" then i want to get "content" value for it. Example: if the web-page has this meta-tag then for that URL i want "wikipedia,encyclopedia" to be extracted.

One approach is to download the web-page using "wget" and then parse it using some standard HTML parser.

I was wondering is there any better way to do this without downloading the entire web-page.

Upvotes: 0

Views: 766

Answers (1)

Karoly Horvath
Karoly Horvath

Reputation: 96258

What you described is the simplest solution to implement.

If you worried about the network traffic generated you could write a small program that only reads the header. As soon as you read the <body..> tag you can finish downloading.

Update: You have to set a very small receive buffer for you socket otherwise the kernel will probably still download the whole page. Verify your solution with tcpdump.

Upvotes: 0

Related Questions