Reputation: 714
A certain page retrieved from a URL, has the following syntax :
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
I want to extract the data in Name, Surname etc. (I have to repeat this task for many pages)
For that I tried using the following code:
import urllib2
url = 'http://www.my.lk/details.aspx?view=1&id=%2031'
source = urllib2.urlopen(url)
start = '<p><strong>Given Name:</strong>'
end = '<strong>Surname'
givenName=(source.read().split(start))[1].split(end)[0]
start = 'Surname: </strong>'
end = 'Former/AKA Name'
surname=(source.read().split(start))[1].split(end)[0]
print(givenName)
print(surname)
When I'm calling the source.read.split method only one time it works fine. But when I use it twice it gives a list index out of range error.
Can someone suggest a solution?
Upvotes: 5
Views: 17251
Reputation: 378
You can use HTQL:
page="""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
import htql
print(htql.query(page, "<p>.<strong> {a=:tx; b=:xx} "))
# [('Name:', ' Pasan '),
# ('Surname: ', ' Wijesingher '),
# ('Former/AKA Name:', ' No Former/AKA Name '),
# ('Gender:', ' Male '),
# ('Language Fluency:', ' ENGLISH ')
# ]
Upvotes: 1
Reputation: 19406
You can use BeautifulSoup for parsing the HTML string.
Here is some code you might try,
It is using BeautifulSoup (to get the text made by the html code), then parses the string for extracting the data.
from bs4 import BeautifulSoup as bs
dic = {}
data = \
"""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
soup = bs(data)
# Get the text on the html through BeautifulSoup
text = soup.get_text()
# parsing the text
lines = text.splitlines()
for line in lines:
# check if line has ':', if it doesn't, move to the next line
if line.find(':') == -1:
continue
# split the string at ':'
parts = line.split(':')
# You can add more tests here like
# if len(parts) != 2:
# continue
# stripping whitespace
for i in range(len(parts)):
parts[i] = parts[i].strip()
# adding the vaules to a dictionary
dic[parts[0]] = parts[1]
# printing the data after processing
print '%16s %20s' % (parts[0],parts[1])
A tip:
If you are going to use BeautifulSoup to parse HTML,
You should have certain attributes like class=input
or id=10
, That is, you keep all tags of the same type to be the same id or class.
Update
Well for your comment, see the code below
It applies the tip above, making life (and coding) a lot easier
from bs4 import BeautifulSoup as bs
c_addr = []
id_addr = []
data = \
"""
<h2>Primary Location</h2>
<div class="address" id="10">
<p>
No. 4<br>
Private Drive,<br>
Sri Lanka ON K7L LK <br>
"""
soup = bs(data)
for i in soup.find_all('div'):
# get data using "class" attribute
addr = ""
if i.get("class")[0] == u'address': # unicode string
text = i.get_text()
for line in text.splitlines(): # line-wise
line = line.strip() # remove whitespace
addr += line # add to address string
c_addr.append(addr)
# get data using "id" attribute
addr = ""
if int(i.get("id")) == 10: # integer
text = i.get_text()
# same processing as above
for line in text.splitlines():
line = line.strip()
addr += line
id_addr.append(addr)
print "id_addr"
print id_addr
print "c_addr"
print c_addr
Upvotes: 8
Reputation: 168
If you want to be quick, regexes are more useful for this kind of task. It can be a harsh learning curve at first but regexes will save your butt one day.
Try this code:
# read the whole document into memory
full_source = source.read()
NAME_RE = re.compile('Name:.+?>(.*?)<')
SURNAME_RE = re.compile('Surname:.+?>(.*?)<')
name = NAME_RE.search(full_source, re.MULTILINE).group(1).strip()
surname = SURNAME_RE.search(full_source, re.MULTILINE).group(1).strip()
See here for more info on how to use regexes in python.
A more comprehensive solution would involve parsing the HTML (using a lib like BeautifulSoup), but that can be overkill depending on your particular application.
Upvotes: 1
Reputation: 111
You are calling read() twice. That is the problem. Instead of doing that you want to call read once, store the data in a variable, and use that variable where you were calling read(). Something like this:
fetched_data = source.read()
Then later...
givenName=(fetched_data.split(start))[1].split(end)[0]
and...
surname=(fetched_data.split(start))[1].split(end)[0]
That should work. The reason your code didn't work is because the read() method is reading the content the first time, but after it gets done reading it is looking at the end of the content. The next time you call read() it has no more content remaining and throws an exception.
Check out the docs for urllib2 and methods on file objects
Upvotes: 5