goldisfine
goldisfine

Reputation: 4850

Extracting name from line

I have data in the following format:

Bxxxx, Mxxxx F  Birmingham   AL (123) 555-2281  NCC Clinical Mental Health, Counselor Education, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029    -99.8115
Axxxx, Axxxx Brown  Birmingham   AL (123) 555-2281  NCC Clinical Mental Health, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029    -99.8115
Axxxx, Bxxxx    Mobile   AL (123) 555-8011  NCC Childhood & Adolescence, Clinical Mental Health, Sexual Abuse Recovery, Disaster Counseling English 99.68639    -99.053238
Axxxx, Rxxxx Lunsford   Athens   AL (123) 555-8119  NCC, NCCC, NCSC Career Development, Childhood & Adolescence, School, Disaster Counseling, Supervision   English 99.804501   -99.971283
Axxxx, Mxxxx    Mobile   AL (123) 555-5963  NCC Clinical Mental Health, Counselor Education, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling, Supervision   English 99.68639    -99.053238
Axxxx, Txxxx    Mountain Brook   AL (123) 555-3099  NCC Addictions and Dependency, Career Development, Childhood & Adolescence, Corrections/Offenders, Sexual Abuse Recovery    English 99.50214    -99.75557
Axxxx, Lxxxx    Birmingham   AL (123) 555-4550  NCC Addictions and Dependency, Eating Disorders English 99.52029    -99.8115
Axxxx, Wxxxx    Birmingham   AL (123) 555-2328  NCC     English 99.52029    -99.8115
Axxxx, Rxxxx    Mobile   AL (123) 555-9411  NCC Addictions and Dependency, Childhood & Adolescence, Couples & Family, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill English 99.68639    -99.053238

And need to extract only the person names. Ideally, I'd be able to use humanName to get a bunch of name objects with fields name.first, name.middle, name.last, name.title...

I've tried iterating through until I hit the first two consecutive caps letters representing the state and then storing the stuff previous into a list and then calling humanName but that was a disaster. I don't want to continue to try this method.

Is there a way to sense the starts and ends of words? That might be helpful...

Recommendations?

Upvotes: 2

Views: 218

Answers (2)

Jordan
Jordan

Reputation: 32522

Your best bet is to find a different data source. Seriously. This one is farked.

If you can't do that, then I would do some work like this:

  1. Replace all double spaces with single spaces.
  2. Split the line by spaces
  3. Take the last 2 items in the list. Those are lat and lng
  4. Looping backwards in the list, do a lookup of each item into a list of potential languages. If the lookup fails, you are done with languages.
  5. Join the remaining list items back with spaces
  6. In the line, find the first opening paren. Read about 13 or 14 characters in, replace all punctuation with empty strings, and reformat it as a normal phone number.
  7. Split the remainder of the line after the phone number by commas.
  8. Using that split, loop through each item in the list. If the text starts with more than 1 capital letter, add it to certifications. Otherwise, add it to areas of practice.
  9. Going back to the index you found in step #6, get the line up until then. Split it on spaces, and take the last item. That's the state. All that's left is name and city!
  10. Take the first 2 items in the space-split line. That's your best guess for name, so far.
  11. Look at the 3rd item. If it is a single letter, add it to the name and remove from the list.
  12. Download US.zip from here: http://download.geonames.org/export/zip/US.zip
  13. In the US data file, split all of it on tabs. Take the data at indexes 2 and 4, which are city name and state abbreviation. Loop through all data and insert each row, concatenated as abbreviation + ":" + city name (i.e. AK:Sand Point) into a new list.
  14. Make a combination of all possible joins of the remaining items in your line, in the same format as in step #13. So you'd end up with AL:Brown Birmingham and AL:Birmingham for the 2nd line.
  15. Loop through each combination and search for it in the list you created in step #13. If you found it, remove it from the split list.
  16. Add all remaining items in the string-split list to the person's name.
  17. If desired, split the name on the comma. index[0] is the last name index[1] is all remaining names. Don't make any assumptions about middle names.

Just for giggles, I implemented this. Enjoy.

import itertools

# this list of languages could be longer and should read from a file
languages = ["English", "Spanish", "Italian", "Japanese", "French",
             "Standard Chinese", "Chinese", "Hindi", "Standard Arabic", "Russian"]

languages = [language.lower() for language in languages]

# Loop through US.txt and format it. Download from geonames.org.
cities = []
with open('US.txt', 'r') as us_data:
    for line in us_data:
        line_split = line.split("\t")
        cities.append("{}:{}".format(line_split[4], line_split[2]))

# This is the dataset
with open('state-teachers.txt', 'r') as teachers:
    next(teachers)  # skip header

    for line in teachers:
        # Replace all double spaces with single spaces
        while line.find("  ") != -1:
            line = line.replace("  ", " ")

        line_split = line.split(" ")

        # Lat/Lon are the last 2 items
        longitude = line_split.pop().strip()
        latitude = line_split.pop().strip()

        # Search for potential languages and trim off the line as we find them
        teacher_languages = []

        while True:
            language_check = line_split[-1]
            if language_check.lower().replace(",", "").strip() in languages:
                teacher_languages.append(language_check)
                del line_split[-1]
            else:
                break

        # Rejoin everything and then use phone number as the special key to split on
        line = " ".join(line_split)

        phone_start = line.find("(")
        phone = line[phone_start:phone_start+14].strip()

        after_phone = line[phone_start+15:]

        # Certifications can be recognized as acronyms
        # Anything else is assumed to be an area of practice
        certifications = []
        areas_of_practice = []

        specialties = after_phone.split(",")
        for specialty in specialties:
            specialty = specialty.strip()
            if specialty[0:2].upper() == specialty[0:2]:
                certifications.append(specialty)
            else:
                areas_of_practice.append(specialty)

        before_phone = line[0:phone_start-1]
        line_split = before_phone.split(" ")

        # State is the last column before phone
        state = line_split.pop()

        # Name should be the first 2 columns, at least. This is a basic guess.
        name = line_split[0] + " " + line_split[1]

        line_split = line_split[2:]

        # Add initials
        if len(line_split[0].strip()) == 1:
            name += " " + line_split[0].strip()
            line_split = line_split[1:]

        # Combo of all potential word combinations to see if we're dealing with a city or a name
        combos = [" ".join(combo) for combo in set(itertools.permutations(line_split))] + line_split

        line = " ".join(line_split)
        city = ""

        # See if the state:city combo is valid. If so, set it and let everything else be the name
        for combo in combos:
            if "{}:{}".format(state, combo) in cities:
                city = combo
                line = line.replace(combo, "")
                break

        # Remaining data must be a name
        if line.strip() != "":
            name += " " + line

        # Clean up names
        last_name, first_name = [piece.strip() for piece in name.split(",")]

        print first_name, last_name

Upvotes: 1

brechin
brechin

Reputation: 589

Not a code answer, but it looks like you could get most/all of the data you're after from the licensing board at http://www.abec.alabama.gov/rostersearch2.asp?search=%25&submit1=Search. Names are easy to get there.

Upvotes: 1

Related Questions