kstats9pt3
kstats9pt3

Reputation: 873

SAS Data Cleaning for Messy Name Variable with Middle Initial

I have a messy dataset that includes hundreds of manager names, some with their middle initials and some without (which inflates the number of unique manager names and any analyses run by manager name). I want to take the fullest name possible and replace the shorter names, i.e. if the manager name includes a middle initial, it will use that to replace the shorter manager names. Below is a sample dataset with fake names:

data test;
    input project name $50.;
    datalines;
    1   Farmer,Richard
    2   Farmer,Richard L
    3   Farmer,Richard
    4   Farmer,Richard
    5   Farmer,Richard L
    6   Rooney,Douglas
    7   Rowe,Jamie Elaine
    8   Rowe,Jamie
    9   Rowe,Jamie
    10  Rowe,Jamie Elaine
    11  Smith,Mary-Lou
    12  Needs,Mark
    13  Needs,Mark H
    14  Smith,Marie
    ;
run;

As such, I want to replace all Farmer,Richard names with Farmer,Richard L and likewise with the rest of the names. Is there a quick method for this? I've tried first. and last. methods, counting, lagging, and just can't seem to find a smooth method of cleaning. DQMATCH seems to do what I want, but I don't have that licensed. Any insight?

Upvotes: 2

Views: 1147

Answers (2)

itzy
itzy

Reputation: 11755

I've found SAS to be less than ideal for working with this kind of thing. Do you necessarily have to do it in SAS? Some alternatives to consider would be:

  • The OpenRefine tool (used to be called GoogleRefine, as it was initially developed there). It's fairly easy to get started using it, and really fantastic at many kinds of data cleaning. There are some nice introductory videos on the old documentation page. (All you'll need to do is read in your data, create a "text facet" on the name column, and then use the clustering algorithms. Hopefully this will help you get started.)

  • If you can work in Perl, python, or something similar, you can use the "fingerprint" concept to facilitate matching. (OpenRefine uses this as one of its base techniques, but you can use it generally.) I'm sure this approach could also be coded in SAS. Once you have a fingerprint for each of your names, you could group them when the fingerprints are sufficiently close -- where you can measure distance with something like the complev function in SAS if you wish.

Upvotes: 3

Joe
Joe

Reputation: 63424

The basic strategy I recommend:

  • Figure out a method to identify which name you want to be the "master" name.
  • Figure out a method for identifying names that are common. It wasn't clear to me if Mary-Lou Smith and Marie Smith were intended to be identical people or not; in your case this may just be having consecutive records grouped in some fashion.
  • Use your first method (for identifying master names), give them some value you can sort by where that value is placed first in the sort order.
  • Then sort, and use FIRST. and retain to keep that version of the name on other rows with the common name.

This works for your test data I suspect, given the rule that "longest name is best", and assuming a change in last name automatically means a new person and two consecutive people never share last names. These probably aren't realistic rules for your production data, but you'll need to figure out how to implement those.

data test;
    input project name $50.;
    datalines;
    1   Farmer,Richard
    2   Farmer,Richard L
    3   Farmer,Richard
    4   Farmer,Richard
    5   Farmer,Richard L
    6   Rooney,Douglas
    7   Rowe,Jamie Elaine
    8   Rowe,Jamie
    9   Rowe,Jamie
    10  Rowe,Jamie Elaine
    11  Smith,Mary-Lou
    12  Needs,Mark
    13  Needs,Mark H
    14  Smith,Marie
    ;
run;

data before_groups;
    set test;
    lname= scan(name,1,',');
run;

data for_groups;
    set before_groups;
    by lname notsorted;
    if first.lname then group+1;
run;

data for_master_name;
    set for_groups;
    by group;
    name_Quality = length(name);
run;

proc sort data=for_master_name;
    by group descending name_quality;
run;

data final_name;
    set for_master_name;
    retain name_master;
    by group;
    if first.group then name_master=name;
run;

Upvotes: 2

Related Questions