Reputation: 85
We can download the source of the page using wget
or curl
, but I want to extract the source of the page without tags.
I mean extract it as text.
Upvotes: 2
Views: 11483
Reputation: 15461
You can pipe to a simple sed command :
curl www.gnu.org | sed 's/<\/*[^>]*>//g'
Upvotes: 8
Reputation: 41
Using Curl, Wget and Apache Tika Server (locally) you can parse HTML into simple text directly from the command line.
First, you have to download the tika-server jar from the Apache site: https://tika.apache.org/download.html
Then, run it as a local server:
$ java -jar tika-server-1.12.jar
After that, you can start parsing text using the following url:
Now, to parse the HTML of webpage into simple text:
$ wget -O test.html YOUR-HTML-URL && curl -H "Accept: text/plain" -T test.html http://localhost:9998/tika
That should return the webpage text without tags.
This way you're using wget to download and save your desired webpage to "test.html" and then you use curl to send a request to the tika server in order to extract the text. Notice that it's necessary to send the header "Accept: text/plain" because tika can return several formats, not just plain text.
Upvotes: 1
Reputation: 3859
Create a Ruby script that uses Nokogiri to parse the HTML:
require 'nokogiri'
require 'open-uri'
html = Nokogiri::HTML(open 'https://stackoverflow.com/questions/6129357')
text = html.at('body').inner_text
puts text
It would probably be simple to do with Javascript or Python if you're more comfortable with that, or search for a html-to-text utility. I imagine it would be very difficult to do this purely in bash.
See also: bash command to covert html page to a text file
Upvotes: 0