Reputation: 1739
i'm just learning bash scripting, i was trying to scrape some data out of a site, mostly wikitionary. This is what I'm trying on the command line right now but it is not returning any result
wget -qO- http://en.wiktionary.org/wiki/robust | egrep '<ol>{[a-zA-Z]*[0-9]*}*</ol>'
What i'm trying is to get the data between the tags, just want them to be displayed. Can you please help me find out what I'm doing wrong ?
Thanks
Upvotes: 5
Views: 13349
Reputation: 20045
If I understand the question correctly then the goal is to extract the visible text content from within ol-sections. I would do it this way:
wget -qO- http://en.wiktionary.org/wiki/robust |
hxnormalize -x |
hxselect "ol" |
lynx -stdin -dump -nolist
[source: "Using the Linux Shell for Web Scraping"]
hxnormalize preprocesses the HTML code for hxselect which applies the CSS selector "ol". Lynx will render the code and reduce it to what is visible in a browser.
Upvotes: 1
Reputation: 420991
At least you need to
-e
switch.-O -
optionHonestly, I'd say grep is the wrong tool for this task, since grep works on a per-line basis, and your expression stretches over several lines.
I think sed
or awk
would be a better fit for this task.
With sed
it would look like
wget -O - -q http://en.wiktionary.org/wiki/robust | sed -n "/<ol>/,/<\/ol>/p"
If you want to get rid of the extra <ol>
and </ol>
you could do append
... | grep -v -E "</?ol>"
Related links
Upvotes: 2
Reputation: 31182
you need to send output to stdout:
wget -q http://en.wiktionary.org/wiki/robust -q -O - | ...
to get all <ol>
tags with grep you can do:
wget -q http://en.wiktionary.org/wiki/robust -O - | tr '\n' ' ' | grep -o '<ol>.*</ol>'
Upvotes: 4