bobsr
bobsr

Reputation: 3965

Python string split

what would be the best way to split this in python. (address, city, state, zip)

<div class="adtxt">7616 W Belmont Ave<br />Chicago, IL 60634-3225</div>

in some case zip code is as

 <div class="adtxt">7616 W Belmont Ave<br />Chicago, IL 60634</div>

Upvotes: 2

Views: 843

Answers (3)

SiggyF
SiggyF

Reputation: 23215

Combining beautifulsoup and the regular expressions should give you something like:

import BeautifulSoup
import re
thestring = r'<div class="adtxt">7616 W Belmont Ave<br />Chicago, IL 60634-3225</div>'
re0 = re.compile(r'(?P<address>[^<]+)')
re1 = re.compile(r'(?P<city>[^,]+), (?P<state>\w\w) (?P<zip>\d{5}-\d{4})')
soup = BeautifulSoup.BeautifulSoup(thestring)
(address,) = re0.search(soup.div.contents[0]).groups()
city, state, zip = re1.search(soup.div.contents[2]).groups()

Upvotes: 0

thevilledev
thevilledev

Reputation: 2397

Just a hint: there are much better ways to parse HTML than regular expressions, for example Beautiful Soup.

Here's why you shouldn't do that with regular expressions.

EDIT: Oh well, @teepark linked it first. :)

Upvotes: 0

Alex Martelli
Alex Martelli

Reputation: 882691

Depending on how tight or lax you want to be on various aspects that can't be deduced from a single example, something like the following should work...:

import re

s = re.compile(r'^<div.*?>([^<]+)<br.*?>([^,]+), (\w\w) (\d{5}-\d{4})</div>$')
mo = s.match(thestring)
if mo is None:
  raise ValueError('No match for %r' % thestring)
address, city, state, zip = mo.groups()

Upvotes: 3

Related Questions