Reputation: 111

Extract specific numeric data from curl output

Output of "curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | head -115 | tail -3" gives the following

<li>Balance quota:&nbsp;&nbsp;&nbsp;78.26&nbsp;GB</li>
<li>High speed data limit:&nbsp;&nbsp;&nbsp;80.0&nbsp;GB</li>
<li>No. of days left in the current bill cycle:&nbsp;&nbsp;&nbsp;28</li>

and curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | head -115 | tail -3 | awk '{gsub (/ /, " "); gsub (/\<li>/, ""); gsub (/\<\/li>/, " "); print}' gives

Balance quota:   78.26 GB
High speed data limit:   80.0 GB
No. of days left in the current bill cycle:   28

How do I extract only the numeric data from each line? Also, is there a better way to extract that data?

Upvotes: 1

Answers (5)

j202

Reputation: 111

As suggested, I tried the following and I got what I was looking for.

import urllib2
import re
from bs4 import BeautifulSoup
url = 'http://122.160.230.125:8080/gbod/gb_on_demand.do'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
data = []
for li in soup.find_all('li', limit=4):
        somevar =  re.search('\d[\d.]+', li.text).group();
        data.append(somevar)

print "DSL Number: ", data[0]
print "Balance: ", data[1], "GB"
print "Limit: ", data[2], "GB"
print "Days Left: ", data[3]

Using this python script makes more sense than using curl, for my project.

Thank you all for the help.

Upvotes: 0

Felix

Reputation: 4716

Assuming the response is proper XML, you can use xmlstarlet to get the contents of the <li> elements:

http://xmlstar.sourceforge.net/doc/UG/xmlstarlet-ug.html#d0e270

You will have to get your head around how to define the query, but imho it is worth it, as you might find your gained knowledge helpful in future xml/html queries.

There are browser plugins to help you define the css selector you need to pick exactly the li-items you need (instead of assuming they always appear on the same lines). Unfortunately, I cannot find references right now.

From there on, use grep or sed or awk as other advised.

Upvotes: 1

Jotne

Reputation: 41456

One way to do it:

curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | awk -F"[;&<]" 'NR>115-3 && NR<=115 {print $8}'
78.26
80.0
28

PS, if you post the output of curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do we can for sure clean this more.

Upvotes: 1

abarnert

Reputation: 365747

If you want something less brittle than relying on the fact that the lines you want happen to be on lines 113-115, here's some Python code using BeautifulSoup to do the same thing more nicely.

Without knowing what your source file looks like, I had to make a lot of assumptions. In particular, I'm assuming you want to extract numbers from every <li> tag in the file. If you want to extract numbers only from the <li> tags that have numbers, or only from the <li> tags under a particular <ul> tag with a nice id attribute, or accessible through some simple path from the root, or whatever, the code would be a little different.

import re
import urllib.request
import bs4

url = 'http://122.160.230.125:8080/gbod/gb_on_demand.do'
page = urllib.request.urlopen(url).read()
soup = bs4.beautifulSoup(page)
for li in soup.find_all('li'):
    print re.search('\d[\d.]+', li.text).group()

Upvotes: 1

abarnert

Reputation: 365747

Using line counts and regexps to parse HTML is very hacky and very brittle.

But if you want to extend what you're already doing, robustness be damned, all you need is a simple regexp to match numbers:

curl -s http://122.160.230.125:8080/gbod/gb_on_demand.do | 
    head -115 | tail -3 | 
    awk '{gsub (/&nbsp;/, " "); gsub (/\<li>/, ""); gsub (/\<\/li>/, " "); print} |
    grep -o -E -e '[0-9][0-9.]+'

(I can never remember if I've got the flags right to work on all grep variants. That definitely works on BSD grep; if it doesn't work on yours, the flags are -o to print only the match rather than the whole line, -E to use extended regexps instead of basic, and of course -e to specify the pattern.)

Upvotes: 1

Extract specific numeric data from curl output

Answers (5)

Related Questions