Reputation: 3
I use wget to download number of papers matching a given query in scholar.google.com and I obtain a file which shows all the content of the page.
I want to retrieve the last number in the following part of the file "Results 1 - 10 of about 8,890."
I tried:
cat /dir/file | tr -d "," | grep -o -E -- 'about ([^"]+) \w+'
but it outputs:
about <b>8890</b>. (<b>0.12</b> sec) </font></td></tr></table></form> <div class
whereas I just want the 8890 (with no comma which is taken care by tr -d ","
any suggestion on how to improve it? Thank you in advance!
Upvotes: 0
Views: 655
Reputation: 4348
If the html tags (<b>
and </b>
) are present in your file, you'll have to modify your regex to take care of them too. To get just the fragment you're interested in use a lookbehind assertion. Here's something that should work:
cat /dir/file | tr -d "," | grep -oP -- '(?<=about <b>)[^/<> ]+'
Upvotes: 0
Reputation: 2829
Try something like: sed -n 's#.*about <b>\([0-9]*\)</b>.*#\1#p'
instead of grep
.
-n
means don't print input lines as default, s
flag p
means print if substituted.
Upvotes: 0
Reputation: 574
Grep pulls out the right line - use sed after that to chop away what you don't want.
cat /dir/file | tr -d "," | grep -o -E -- 'about ([^"]+) \w+' |sed -e 's/.*about <b>//' -e 's/<.b>.*//'
Upvotes: 3