Reputation: 349
I am working with SQL Server 2012 and Visual Studio 2010 with Excel 2007.
I am building a package that will pull information from a .CSV
file and import it into a table in a DB that we are creating for an HRIS project.
The .CSV
file has the following fields starting in B9 and running through column J (row length will vary)
| PersNo | IDNo | Name | PArea | OrgUnit | OrgName | UserID | EntryDate |
The issue I have is that I need to take the Name
column and split it into a FirstName
and LastName
columns. I have scoured the internet and found a number of forums that detail methods that use Derived Columns, Conditional Split, Script Component and MultiCast. None of them were helpful or fit my needs.
The issue is complicated by the name arrangement. I have three different name types that can be displayed.
John Smith
John A Smith
John Smith III
The last two they will need to be split differently.
If there is a middle initial they want it split as such.
| FirstName | LastName |
------------------------
| John | Smith A |
i.e. Smith and the Middle Initial will be in the LastName
column
If they have a suffix then they will need to be split as such.
| FirstName | LastName |
-------------------------
| John | Smith III |
i.e. Smith and the suffix will be in the LastName
column
I need to know if there is a way that I can split this out in one package.
Eventually this will be automated to run daily so I will also have to utilize a lookup I assume to filter out any duplicates and enter only new data.
I appreciate anything information that anyone can provide.
Upvotes: 2
Views: 1645
Reputation: 9943
Parseing names is surprisingly difficult due to the numerous variations even in just a single culture, as you have already identified. If you can go back to the source data from where the CSV generated and extract this information then that would be preferable but...
One way to handle this is to use several Regular Expressions in a Script Component. Each regex would be responsible for picking out a part of the name which you will later be able to concatenate as you like.
Try something like this, you may have to tweak the regex's, they're definately not my strongest point, and as time goes on you may find cases that don't fit, but it gives you a start.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
const string pFirstName = @"^[A-Z]([-']?[a-z]+)*";
const string pSuffix = @"((Jr|Sr|I|V|X)( ?))*";
const string pInitial = @"(?<=\s)[A-Z](?=\s)";
const string pLastName = @"(?!(?:Jr|Sr|I|V|X|^))([A-Z][-'\s]?[a-z]+)";
string fullName = Row.Name.ToString();
string firstName = Regex.Match(fullName, pFirstName).Value;
string suffix = Regex.Match(fullName, pSuffix).Value;
string initial = Regex.Match(fullName, pInitial).Value;
string lastName = Regex.Match(fullName, pLastName).Value;
if (!string.IsNullOrEmpty(initial))
lastName += " " + initial;
if (!string.IsNullOrEmpty(suffix))
lastName += " " + suffix;
Row.FirstName = firstName;
Row.LastName = lastName;
}
Edit: I have made a simple project demoing the above and showing all three forms of the script component, you can download it here
Upvotes: 1