How do I extract content from a webpage with certain headers in bash?

Question

So far I am using curl along w3m and sed to extract portions of a webpage like ....content..... I want to ignore all the other headers (ex. ,

). Except the way I am doing it right now is really slow.

curl -L "http://www.somewebpage.com" | sed -n -e '\:: p' > file.html 
w3m -dump file.html > file2.txt

These two lines above are really slow because curl was to first save the whole webpage into a file and phrase it, then w3m phrases it and saves it into another file. I just want to simply this code. I was wondering if there was a way with lynx or hmtl2text that lets you extract webpage content with specified headers. So like if I wanted to extract something from as webpage (www.badexample.com <---not actually the link) with this content:

blah......blah...
            
                 Some text I need to extract
            
 more stuffs

Is there a program which i can specify the parameter in which to extract the content? So I would specify someprogram www.badexample.com and it would extract the content only in those headers?

Sabuj Hassan · Accepted Answer

You can use Perl's one liner for this:

perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /<$ARGV[1]>(.*?)<\/$ARGV[1]>/;" http://www.example.com/ title

Instead of the html tag, you can pass the whole regex as well:

perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /$ARGV[1]/;" "http://www.example.com/" "(.*?)"

How do I extract content from a webpage with certain headers in bash?

Answers (2)

Related Questions