Reputation: 604
I have a file having some URLs per line. I need to extract the "keywords" present in the tags i.e. if there is meta tag for "keywords" then i want to get "content" value for it. Example: if the web-page has this meta-tag then for that URL i want "wikipedia,encyclopedia" to be extracted.
One approach is to download the web-page using "wget" and then parse it using some standard HTML parser.
I was wondering is there any better way to do this without downloading the entire web-page.
Upvotes: 0
Views: 766
Reputation: 96258
What you described is the simplest solution to implement.
If you worried about the network traffic generated you could write a small program that only reads the header. As soon as you read the <body..>
tag you can finish downloading.
Update: You have to set a very small receive buffer for you socket otherwise the kernel will probably still download the whole page. Verify your solution with tcpdump
.
Upvotes: 0