Alison LT
Alison LT

Reputation: 405

Converting unordered list of tuples to pandas DataFrame

I am using the library usaddress to parse addresses from a set of files I have. I would like my final output to be a data frame where column names represent parts of the address (e.g. street, city, state) and rows represent each individual address I've extracted. For example:

Suppose I have a list of addresses:

addr = ['123 Pennsylvania Ave NW Washington DC 20008', 
        '652 Polk St San Francisco, CA 94102', 
        '3711 Travis St #800 Houston, TX 77002']

and I extract them using usaddress

info = [usaddress.parse(loc) for loc in addr]

"info" is a list of a list of tuples that looks like this:

[[('123', 'AddressNumber'),
  ('Pennsylvania', 'StreetName'),
  ('Ave', 'StreetNamePostType'),
  ('NW', 'StreetNamePostDirectional'),
  ('Washington', 'PlaceName'),
  ('DC', 'StateName'),
  ('20008', 'ZipCode')],
 [('652', 'AddressNumber'),
  ('Polk', 'StreetName'),
  ('St', 'StreetNamePostType'),
  ('San', 'PlaceName'),
  ('Francisco,', 'PlaceName'),
  ('CA', 'StateName'),
  ('94102', 'ZipCode')],
 [('3711', 'AddressNumber'),
  ('Travis', 'StreetName'),
  ('St', 'StreetNamePostType'),
  ('#', 'OccupancyIdentifier'),
  ('800', 'OccupancyIdentifier'),
  ('Houston,', 'PlaceName'),

I would like each list (there are 3 lists within the object "info") to represent a row, and the 2 value of each tuple pair to denote a column and the 1 value of the tuple pair to be the value. Note: the link of the inner lists will not always be the same as not every address will have every bit of information.

Any help would be much appreciated!

Thanks

Upvotes: 3

Views: 1127

Answers (2)

Alison LT
Alison LT

Reputation: 405

Thank you for your responses! I ended up doing a completely different workaround as follows:

I checked the documentation to see all possible parse_tags from usaddress, created a DataFrame with all possible tags as columns, and one other column with the extracted addresses. Then I proceeded to parse and extract information from the columns using regex. Code below!

parse_tags = ['Recipient','AddressNumber','AddressNumberPrefix','AddressNumberSuffix',
'StreetName','StreetNamePreDirectional','StreetNamePreModifier','StreetNamePreType',
'StreetNamePostDirectional','StreetNamePostModifier','StreetNamePostType','CornerOf',
'IntersectionSeparator','LandmarkName','USPSBoxGroupID','USPSBoxGroupType','USPSBoxID',
'USPSBoxType','BuildingName','OccupancyType','OccupancyIdentifier','SubaddressIdentifier',
'SubaddressType','PlaceName','StateName','ZipCode']

addr = ['123 Pennsylvania Ave NW Washington DC 20008', 
        '652 Polk St San Francisco, CA 94102', 
        '3711 Travis St #800 Houston, TX 77002']

df = pd.DataFrame({'Addresses': addr})
pd.concat([df, pd.DataFrame(columns = parse_tags)])

Then I created a new column that made a string out of the usaddress parse list and called it "Info"

df['Info'] = df['Addresses'].apply(lambda x: str(usaddress.parse(x)))

Now here's the major workaround. I looped through each column name and looked for it in the corresponding "Info" cell and applied regular expressions to extract information where they existed!

for colname in parse_tags:
    df[colname] = df['Info'].apply(lambda x: re.findall("\('(\S+)', '{}'\)".format(colname), x)[0] if re.search(
    colname, x) else "")

This is probably not the most efficient way, but it worked for my purposes. Thanks everyone for providing suggestions!

Upvotes: 1

Brad Solomon
Brad Solomon

Reputation: 40918

Not sure if there is a DataFrame constructor that can handle info exactly as you have it now. (Maybe from_records or from_items?--still don't think this structure would be directly compatible.)

Here's a bit of manipulation to get what you're looking for:

cols = [j for _, j in info[0]]

# Could use nested list comprehension here, but this is probably
#     more readable.
info2 = []
for row in info:
    info2.append([i for i, _ in row])

pd.DataFrame(info2, columns=cols)

  AddressNumber    StreetName StreetNamePostType StreetNamePostDirectional   PlaceName StateName ZipCode
0           123  Pennsylvania                Ave                   NW       Washington        DC   20008
1           652          Polk                 St                  San       Francisco,        CA   94102

Upvotes: 1

Related Questions