Reputation: 339
I am making an app that is reading an excel file containing customer information and pushing this data into an SQL database. The problem is that the original designer instead of using a columns to store customer name
, customer phone
, secondary contact name
, and secondary contact phone
they put it all of it into one long sentence in one single column.
My plan is to strip out all whitespace and non alpha-numeric characters from the entry so I essentially get one long string which at its longest could be something like this
JeffSmith07621589641SarahSmith09854315741
I intended to split the names and the numbers, (and further split the names again into first/last), using a regular expression. I'd been trying regex.split like so
String[] splitArray = Regex.Split("JeffSmith07621589641SarahSmith09854315741",
@"(?<=[a-zA-Z])(?=\d)");
I hoped to get 4 elements but instead my results are coming out like this
stringArray[0] = JeffSmith
stringArray[1] = 07621589641SarahSmith
stringArray[3] = 09854315741
As you can see I'm not getting a split from the first phone number, second name.
What would be the best way to extract the data?
If its a regex, what is needed to add to the regular expression in order to achieve what I'm looking for?
Regex is slow and I have around 4000 records to process in the Excel file.
Upvotes: 2
Views: 974
Reputation: 31721
Or you can put them into entities such as this:
var text = "JeffSmith07621589641SarahSmith09854315741";
var pattern = @"(?<Name>[^\d]+)(?<Number>\d+)";
var entities =
Regex.Matches(text, pattern)
.OfType<Match>()
.Select(mt => new
{
Name = mt.Groups["Name"].Value,
Phone = mt.Groups["Number"].Value,
});
Result
Or get first and last name if they follow the pattern of Uppercase followed by lower case:
var pattern = @"(?<First>[A-Z][a-z]+)(?<Last>[^\d]+)(?<Number>\d+)";
var entities =
Regex.Matches(text, pattern)
.OfType<Match>()
.Select(mt => new
{
NameFirst = mt.Groups["First"].Value,
NameLast = mt.Groups["Last"].Value,
Phone = mt.Groups["Number"].Value,
});
Update to Organize Excel First
discovered that some of the name entries are entirely in lowercase
I recommend that you alter the text in Excel before creating an Erroneous records list. Do these steps
=(CODE(LEFT(B3))>90)+0
then sort by that. It will allow you to sort alphabetically upper to lower case or simply filter out the upper case and adjust accordingly. Regex is slow and I have around 4000 records to process in the Excel file.
Regex is slow when users design patterns which create lots of backtracking (read user error) during processing.
Simply avoiding .*
in favor of .+
in most patterns can alleviate most of the time it takes for regex to process the data by avoiding a good chunk of backtracking.
Frankly reading from Excel will be the largest bottleneck. I would surmise if a test is run on 4000K records, regex processing would only add between 5 and 15 seconds with Excel and writing to the DB taking the lion share of time. (IMHO)
Upvotes: 2
Reputation: 174844
Ya, just do the same for another possibility. ie, match also the boundary which exists between a digit and letter. Currently your regex only matches the boundary which exists between letter and a digit.
String[] splitArray = Regex.Split("JeffSmith07621589641SarahSmith09854315741", @"(?<=[a-zA-Z])(?=\d)|(?<=\d)(?=[a-zA-Z])");
Upvotes: 3