Reputation: 405
I am using the library usaddress
to parse addresses from a set of files I have. I would like my final output to be a data frame where column names represent parts of the address (e.g. street, city, state) and rows represent each individual address I've extracted. For example:
Suppose I have a list of addresses:
addr = ['123 Pennsylvania Ave NW Washington DC 20008',
'652 Polk St San Francisco, CA 94102',
'3711 Travis St #800 Houston, TX 77002']
and I extract them using usaddress
info = [usaddress.parse(loc) for loc in addr]
"info" is a list of a list of tuples that looks like this:
[[('123', 'AddressNumber'),
('Pennsylvania', 'StreetName'),
('Ave', 'StreetNamePostType'),
('NW', 'StreetNamePostDirectional'),
('Washington', 'PlaceName'),
('DC', 'StateName'),
('20008', 'ZipCode')],
[('652', 'AddressNumber'),
('Polk', 'StreetName'),
('St', 'StreetNamePostType'),
('San', 'PlaceName'),
('Francisco,', 'PlaceName'),
('CA', 'StateName'),
('94102', 'ZipCode')],
[('3711', 'AddressNumber'),
('Travis', 'StreetName'),
('St', 'StreetNamePostType'),
('#', 'OccupancyIdentifier'),
('800', 'OccupancyIdentifier'),
('Houston,', 'PlaceName'),
I would like each list (there are 3 lists within the object "info") to represent a row, and the 2 value of each tuple pair to denote a column and the 1 value of the tuple pair to be the value. Note: the link of the inner lists will not always be the same as not every address will have every bit of information.
Any help would be much appreciated!
Thanks
Upvotes: 3
Views: 1127
Reputation: 405
Thank you for your responses! I ended up doing a completely different workaround as follows:
I checked the documentation to see all possible parse_tags from usaddress
, created a DataFrame with all possible tags as columns, and one other column with the extracted addresses. Then I proceeded to parse and extract information from the columns using regex
. Code below!
parse_tags = ['Recipient','AddressNumber','AddressNumberPrefix','AddressNumberSuffix',
'StreetName','StreetNamePreDirectional','StreetNamePreModifier','StreetNamePreType',
'StreetNamePostDirectional','StreetNamePostModifier','StreetNamePostType','CornerOf',
'IntersectionSeparator','LandmarkName','USPSBoxGroupID','USPSBoxGroupType','USPSBoxID',
'USPSBoxType','BuildingName','OccupancyType','OccupancyIdentifier','SubaddressIdentifier',
'SubaddressType','PlaceName','StateName','ZipCode']
addr = ['123 Pennsylvania Ave NW Washington DC 20008',
'652 Polk St San Francisco, CA 94102',
'3711 Travis St #800 Houston, TX 77002']
df = pd.DataFrame({'Addresses': addr})
pd.concat([df, pd.DataFrame(columns = parse_tags)])
Then I created a new column that made a string out of the usaddress
parse list and called it "Info"
df['Info'] = df['Addresses'].apply(lambda x: str(usaddress.parse(x)))
Now here's the major workaround. I looped through each column name and looked for it in the corresponding "Info" cell and applied regular expressions to extract information where they existed!
for colname in parse_tags:
df[colname] = df['Info'].apply(lambda x: re.findall("\('(\S+)', '{}'\)".format(colname), x)[0] if re.search(
colname, x) else "")
This is probably not the most efficient way, but it worked for my purposes. Thanks everyone for providing suggestions!
Upvotes: 1
Reputation: 40918
Not sure if there is a DataFrame constructor that can handle info
exactly as you have it now. (Maybe from_records
or from_items
?--still don't think this structure would be directly compatible.)
Here's a bit of manipulation to get what you're looking for:
cols = [j for _, j in info[0]]
# Could use nested list comprehension here, but this is probably
# more readable.
info2 = []
for row in info:
info2.append([i for i, _ in row])
pd.DataFrame(info2, columns=cols)
AddressNumber StreetName StreetNamePostType StreetNamePostDirectional PlaceName StateName ZipCode
0 123 Pennsylvania Ave NW Washington DC 20008
1 652 Polk St San Francisco, CA 94102
Upvotes: 1