Reputation: 401
So far I am using curl
along w3m
and sed
to extract portions of a webpage like <body>
....content....</body>
. I want to ignore all the other headers (ex. <a></a>
, <div></div>
). Except the way I am doing it right now is really slow.
curl -L "http://www.somewebpage.com" | sed -n -e '\:<article class=:,\:<div id="below">: p' > file.html
w3m -dump file.html > file2.txt
These two lines above are really slow because curl
was to first save the whole webpage into a file and phrase it, then w3m
phrases it and saves it into another file. I just want to simply this code. I was wondering if there was a way with lynx
or hmtl2text
that lets you extract webpage content with specified headers. So like if I wanted to extract something from as webpage (www.badexample.com <---not actually the link) with this content:
<title>blah......blah...</title>
<body>
Some text I need to extract
</body>
more stuffs
Is there a program which i can specify the parameter in which to extract the content? So I would specify someprogram <body></body> www.badexample.com
and it would extract the content only in those headers?
Upvotes: 1
Views: 756
Reputation: 26074
Must it be in bash
? What about PHP
and DOMDocument()
?
$dom = new DOMDocument();
$new_dom = new DOMDocument();
$url_value = 'http://www.google.com';
$html = file_get_contents($url_value);
$dom->loadHTML($html);
$body = $dom->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$new_dom->appendChild($new_dom->importNode($child, true));
}
echo $new_dom->saveHTML();
Upvotes: 1
Reputation: 39395
You can use Perl's one liner for this:
perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /<$ARGV[1]>(.*?)<\/$ARGV[1]>/;" http://www.example.com/ title
Instead of the html tag, you can pass the whole regex as well:
perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /$ARGV[1]/;" "http://www.example.com/" "<body>(.*?)</body>"
Upvotes: 1