Reputation: 3958
I am trying to extract some contact details from a web page, and I successfully extracted some informations using Beautiful Soup.
But I can't extract some data because it is not properly constructed(html). So I am using regular expressions. But last couple of hours I'm trying to learn regular expressions and I am kinda struck.
InstanceBeginEditable name="additional_content"
<h1>Contact details</h1>
<h2>Diploma coordinator</h2>
Mr. Matthew Schultz<br />
<br />
610 Maryhill Drive<br />
Green Bay<br />
WI<br />
United States<br />
54303<br />
Contact by email</a><br />
Phone (1) 920 429 6158
<hr /><br />
I need to extract,
Mr. Matthew Schultz
610 Maryhill Drive Green Bay WI United States 54303
And phone number. I tried things which I found from google search. But none works(because of my little knowledge, but here my last effort.
con = ""
for content in contactContent.contents:
con += str(content)
print con
address = re.search("Mr.\b[a-zA-Z]", con)
print str(address)
Sometimes I get None.
Please help guys!
PS. Content is freely available in net No copyright infringed.
Upvotes: 0
Views: 253
Reputation: 951
You asked about doing this with a regex. Assuming you get a new multiline string with this data for each div, you could extract the data like this:
import re
m = re.search('</h2>\s+(.*?)<br />\s+<br />\s+(.*?)<br />\s+(.*?)<br />\s+(.*?)<br />\s+(.*?)<br />\s+(.*?)<br />', con )
if m:
print m.groups()
output:
('Mr. Matthew Schultz', '610 Maryhill Drive', 'Green Bay', 'WI', 'United States', '54303')
I see you are off to an OK start with regex. The key to regex is to remember that you generally want to define a digit or group of digits, followed by a quantity expression which tells it how many times you want your expression repeated. In this case, we start with </h2>
followed by \s+
which tells the regex engine we want one or more space characters (which includes newline). The only other nuance here is the next expression which is (.*?)
is a lazy capture all - it will grab anything until it runs into the next expression which is the next <br />
.
Edit: also, you should be able to clean up the regex by taking advantage of the fact that after the name all of the address information is in a uniform format. I played with it a little but wasn't getting it so if you wanted to improve it that would be an approach.
Upvotes: 1
Reputation: 10923
OK, using your data, EDIT to embed the parsing routine inside a function
def parse_list(source):
lines = ''.join( source.split('\n') )
lines = lines[ lines.find('</h2>')+6 : lines.find('Contact by email') ]
lines = [ line.strip()
for line in lines.split('<br />')
if line.strip() != '']
return lines
# Parse the page and retrieve contact string from the relevant <div>
con = ''' InstanceBeginEditable name="additional_content"
<h1>Contact details</h1>
<h2>Diploma coordinator</h2>
Mr. Matthew Schultz<br />
<br />
610 Maryhill Drive<br />
Green Bay<br />
WI<br />
United States<br />
54303<br />
Contact by email</a><br />
Phone (1) 920 429 6158
<hr /><br />'''
# Extract details and print to console
details = parse_list(con)
print details
This will output a list:
['Mr. Matthew Schultz', '610 Maryhill Drive', 'Green Bay', 'WI', 'United States', '54303']
Upvotes: 1