Reputation: 873
I have a messy dataset that includes hundreds of manager names, some with their middle initials and some without (which inflates the number of unique manager names and any analyses run by manager name). I want to take the fullest name possible and replace the shorter names, i.e. if the manager name includes a middle initial, it will use that to replace the shorter manager names. Below is a sample dataset with fake names:
data test;
input project name $50.;
datalines;
1 Farmer,Richard
2 Farmer,Richard L
3 Farmer,Richard
4 Farmer,Richard
5 Farmer,Richard L
6 Rooney,Douglas
7 Rowe,Jamie Elaine
8 Rowe,Jamie
9 Rowe,Jamie
10 Rowe,Jamie Elaine
11 Smith,Mary-Lou
12 Needs,Mark
13 Needs,Mark H
14 Smith,Marie
;
run;
As such, I want to replace all Farmer,Richard
names with Farmer,Richard L
and likewise with the rest of the names. Is there a quick method for this? I've tried first.
and last.
methods, counting, lagging, and just can't seem to find a smooth method of cleaning. DQMATCH
seems to do what I want, but I don't have that licensed. Any insight?
Upvotes: 2
Views: 1147
Reputation: 11755
I've found SAS to be less than ideal for working with this kind of thing. Do you necessarily have to do it in SAS? Some alternatives to consider would be:
The OpenRefine tool (used to be called GoogleRefine, as it was initially developed there). It's fairly easy to get started using it, and really fantastic at many kinds of data cleaning. There are some nice introductory videos on the old documentation page. (All you'll need to do is read in your data, create a "text facet" on the name column, and then use the clustering algorithms. Hopefully this will help you get started.)
If you can work in Perl, python, or something similar, you can use the "fingerprint" concept to facilitate matching. (OpenRefine uses this as one of its base techniques, but you can use it generally.) I'm sure this approach could also be coded in SAS. Once you have a fingerprint for each of your names, you could group them when the fingerprints are sufficiently close -- where you can measure distance with something like the complev
function in SAS if you wish.
Upvotes: 3
Reputation: 63424
The basic strategy I recommend:
FIRST.
and retain
to keep that version of the name on other rows with the common name.This works for your test data I suspect, given the rule that "longest name is best", and assuming a change in last name automatically means a new person and two consecutive people never share last names. These probably aren't realistic rules for your production data, but you'll need to figure out how to implement those.
data test;
input project name $50.;
datalines;
1 Farmer,Richard
2 Farmer,Richard L
3 Farmer,Richard
4 Farmer,Richard
5 Farmer,Richard L
6 Rooney,Douglas
7 Rowe,Jamie Elaine
8 Rowe,Jamie
9 Rowe,Jamie
10 Rowe,Jamie Elaine
11 Smith,Mary-Lou
12 Needs,Mark
13 Needs,Mark H
14 Smith,Marie
;
run;
data before_groups;
set test;
lname= scan(name,1,',');
run;
data for_groups;
set before_groups;
by lname notsorted;
if first.lname then group+1;
run;
data for_master_name;
set for_groups;
by group;
name_Quality = length(name);
run;
proc sort data=for_master_name;
by group descending name_quality;
run;
data final_name;
set for_master_name;
retain name_master;
by group;
if first.group then name_master=name;
run;
Upvotes: 2