Reputation: 105
I am very new to Python but seem to be getting along. I am writing a web crawler in Python.
I've got the crawler working using the Beautiful Soup library and want to find the best library for parsing or splitting an address into it constituent parts.
Here is a sample of the text to be parsed.
['\r\n\t \t\t \t25 Stockwood Road', <br/>, 'Asheville, NC 28803', <br/>, '\t (828) 505-1638\t \t']
I understand it's a list and I can figure out how to remove the control character.
Since I'm so new I'd like recommendations on what libraries are being used for this - Python version, OS and perquisites.
I'd like to figure out the code for myself, but if you inclined to offer a sample, I wouldn;t argue. :)
Upvotes: 0
Views: 4386
Reputation: 96
you can try the python library usaddress (there's also a web interface for trying it out)
it parses addresses probabilistically, and is much more robust than regex-based parsers when dealing with messy addresses.
Upvotes: 1
Reputation: 3676
List Comprehension is pretty sleek for something like this. Also look into String Strip. It won't remove HTML blank elements though, but the tabs, newlines and spaces will be cleaned up.
out = [x.strip() for x in lst]
Upvotes: 0