Reputation: 1866
I am trying to extract some information from this website i.e. the line which says:
Scale(Virgo + GA + Shapley): 29 pc/arcsec = 0.029 kpc/arcsec = 1.72 kpc/arcmin = 0.10 Mpc/degree
but everything after the : is variable depending on galtype.
I have written a code which used beautifulsoup and urllib and returns sone information, but i am struggling to reduce the data further to just the information I want. How do I get just the information I want?
galname='M82'
a='http://ned.ipac.caltech.edu/cgi-bin/objsearch?objname='+galname+'&extend'+\
'=no&hconst=73&omegam=0.27&omegav=0.73&corr_z=1&out_csys=Equatorial&out_equinox=J2000.0&obj'+\
'_sort=RA+or+Longitude&of=pre_text&zv_breaker=30000.0&list_limit=5&img_stamp=YES'
print a
import urllib
f = urllib.urlopen(a)
from bs4 import BeautifulSoup
soup=BeautifulSoup(f)
soup.find_all(text=re.compile('Virgo')) and soup.find_all(text=re.compile('GA')) and soup.find_all(text=re.compile('Shapley'))
Upvotes: 2
Views: 427
Reputation: 473863
Define a regular expression pattern that would help BeautifulSoup
to find the appropriate node, then, extract the number using saving groups:
pattern = re.compile(r"D \(Virgo \+ GA \+ Shapley\)\s+:\s+([0-9\.]+)")
print pattern.search(soup.find(text=pattern)).group(1)
Prints 5.92
.
Besides, usually I'm against using regular expressions to parse HTML, but, since this is a text search and we are not going to use regular expressions to match opening or closing tags or anything related to the structure that HTML provides - you can just apply your pattern to the HTML source of the page without involving an HTML parser:
data = f.read()
pattern = re.compile(r"D \(Virgo \+ GA \+ Shapley\)\s+:\s+([0-9\.]+)")
print pattern.search(data).group(1)
Upvotes: 1