Ian Murray
Ian Murray

Reputation: 339

Splitting a long string into numeric and alpha components with Regex Split

I am making an app that is reading an excel file containing customer information and pushing this data into an SQL database. The problem is that the original designer instead of using a columns to store customer name, customer phone, secondary contact name, and secondary contact phone they put it all of it into one long sentence in one single column.

My plan is to strip out all whitespace and non alpha-numeric characters from the entry so I essentially get one long string which at its longest could be something like this

JeffSmith07621589641SarahSmith09854315741

I intended to split the names and the numbers, (and further split the names again into first/last), using a regular expression. I'd been trying regex.split like so

String[] splitArray = Regex.Split("JeffSmith07621589641SarahSmith09854315741", 
                                  @"(?<=[a-zA-Z])(?=\d)");

I hoped to get 4 elements but instead my results are coming out like this

stringArray[0] = JeffSmith
stringArray[1] = 07621589641SarahSmith
stringArray[3] = 09854315741

As you can see I'm not getting a split from the first phone number, second name.

What would be the best way to extract the data?

If its a regex, what is needed to add to the regular expression in order to achieve what I'm looking for?

Regex is slow and I have around 4000 records to process in the Excel file.

Upvotes: 2

Views: 974

Answers (2)

ΩmegaMan
ΩmegaMan

Reputation: 31721

Or you can put them into entities such as this:

var text = "JeffSmith07621589641SarahSmith09854315741";

var pattern = @"(?<Name>[^\d]+)(?<Number>\d+)";

var entities =
         Regex.Matches(text, pattern)
              .OfType<Match>()
              .Select(mt => new
                {
                    Name = mt.Groups["Name"].Value,
                    Phone = mt.Groups["Number"].Value,
                });

Result

enter image description here


Or get first and last name if they follow the pattern of Uppercase followed by lower case:

var pattern = @"(?<First>[A-Z][a-z]+)(?<Last>[^\d]+)(?<Number>\d+)";

var entities =
         Regex.Matches(text, pattern)
              .OfType<Match>()
               .Select(mt => new
                {
                    NameFirst = mt.Groups["First"].Value,
                    NameLast = mt.Groups["Last"].Value,
                    Phone = mt.Groups["Number"].Value,
                });

enter image description here

Update to Organize Excel First

discovered that some of the name entries are entirely in lowercase

I recommend that you alter the text in Excel before creating an Erroneous records list. Do these steps

  1. Put in a column with this formula =(CODE(LEFT(B3))>90)+0 then sort by that. It will allow you to sort alphabetically upper to lower case or simply filter out the upper case and adjust accordingly.

enter image description here


Regex is slow and I have around 4000 records to process in the Excel file.

Regex is slow when users design patterns which create lots of backtracking (read user error) during processing.

Simply avoiding .* in favor of .+ in most patterns can alleviate most of the time it takes for regex to process the data by avoiding a good chunk of backtracking.

Frankly reading from Excel will be the largest bottleneck. I would surmise if a test is run on 4000K records, regex processing would only add between 5 and 15 seconds with Excel and writing to the DB taking the lion share of time. (IMHO)

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174844

Ya, just do the same for another possibility. ie, match also the boundary which exists between a digit and letter. Currently your regex only matches the boundary which exists between letter and a digit.

String[] splitArray = Regex.Split("JeffSmith07621589641SarahSmith09854315741", @"(?<=[a-zA-Z])(?=\d)|(?<=\d)(?=[a-zA-Z])");

DEMO

Upvotes: 3

Related Questions