user1249747
user1249747

Reputation: 3

grep html file from wget

I use wget to download number of papers matching a given query in scholar.google.com and I obtain a file which shows all the content of the page.

I want to retrieve the last number in the following part of the file "Results 1 - 10 of about 8,890."

I tried:

 cat /dir/file | tr -d "," | grep -o -E -- 'about ([^"]+) \w+'

but it outputs:

 about <b>8890</b>.   (<b>0.12</b> sec)&nbsp;</font></td></tr></table></form>    <div class

whereas I just want the 8890 (with no comma which is taken care by tr -d ","

any suggestion on how to improve it? Thank you in advance!

Upvotes: 0

Views: 655

Answers (3)

mohit6up
mohit6up

Reputation: 4348

If the html tags (<b> and </b>) are present in your file, you'll have to modify your regex to take care of them too. To get just the fragment you're interested in use a lookbehind assertion. Here's something that should work: cat /dir/file | tr -d "," | grep -oP -- '(?<=about <b>)[^/<> ]+'

Upvotes: 0

sapht
sapht

Reputation: 2829

Try something like: sed -n 's#.*about <b>\([0-9]*\)</b>.*#\1#p' instead of grep.

-n means don't print input lines as default, s flag p means print if substituted.

Upvotes: 0

Stuart
Stuart

Reputation: 574

Grep pulls out the right line - use sed after that to chop away what you don't want.

 cat /dir/file | tr -d "," | grep -o -E -- 'about ([^"]+) \w+' |sed -e 's/.*about <b>//' -e 's/<.b>.*//' 

Upvotes: 3

Related Questions