user3596998
user3596998

Reputation:

Java Regex to validate and capture ID and full name

I'm working on a program that will run through a list of +20,000 records of ID, last name, first name, middle name. Now, I have a working regex that pulled the records with an ID sequence and grouped them as well as pulled records with a infraction number sequence and grouped them. The difference between the two is that the latter has a 12 character sequence (3 chars and 9 digits as opposed to a 9 digit ID sequence). There is the obvious problem of validating the names, some have last names that are 3+ i.e. de la Cruz, Smith-Doe, or just really long names. The same problem appears for middle names, which are sometimes just middle initial followed by a dot, simply the middle initial (no period), or the actual middle name.

I've created two classes to model the person objects, each with 4 fields (ID/tick num, lName, fName, mName). I want the regex to accurately group and store the 3 parts of a person's full name (as one person object which will be stored in a Vector) so I can later run a search against a person who is both the ticket list and ID list, to then store the matches in a separate list.

My problem is with how to accurately capture valid names. Here's a look at the regex I used to pull the two groups (this was done in python but I'm assuming I can reuse the regex:

'^([A-Z]{3}\d+)\s+([^\s]+([\D+])+)'  --> Ticket group
'^(\d+)\s+([^\s]+([\D+])+)'  ---> ID group

and here's a look at my ReadFile Class, which is meant to open and read the contents of the source file, storing the records as objects in people:

public class ReadFile {
    private Scanner myScan;

    public void openFile(){
        try{
            // Scanner object will hold source list
            myScan = new Scanner(new File("C:\\source.txt"));
        }
        catch(Exception e){
            System.out.println("Could not find file.");
        }
    }

    // readFile method will iterate through and store the contents of source list into people
    public void readFile() {
        Vector<People> people = new Vector<People>();
        while(myScan.hasNext()){
            People person = new People();
            person.setSbID(myScan.next());
            person.setLastName(myScan.next());
            person.setFirstName(myScan.next());
            person.setmInit(myScan.next());
            //add the person to the people list
            people.add(person);

            System.out.printf("%s %s %s %s \n", person.getID(), person.getLastName(), person.getFirstName(), person.getmInit());
        }
    }

    public void closeFile(){
        myScan.close();
    }
}

Right now the data is being passed to the person fields as elements being read from the scanner object but it's not doing it in smart way (.next()). The regex I used was in a python script that parsed the data correctly, I'm just unsure how to go about implementing it in Java. Current excerpt from Scanner:

people.add(person);
String text = person.toString();
String pattern = "^(\\d+)\\s+([^\\s]+([\\D+]+)";
boolean matches = Pattern.matches(pattern, text);
if (true) { System.out.println("matches = " + person); }

Sample data that the program should handle:

092331234 Smith, John M.
ABC097853827 Doe, Mark J

Upvotes: 0

Views: 1614

Answers (1)

Brian Stephens
Brian Stephens

Reputation: 5261

Here's a regex that will match your sample data, splitting it into the four parts:

^((?:[A-Z]{3})?\d{9})\s+(.+?),\s+(\S+)\s+(.+)$

See it work on regex101.

I would be surprised if each line is as similar as you say. I agree with the comment by @adamdc78 that there's no way to determine what's part of a multi-word first name versus middle name.

My regex also makes these assumptions:

  • the ID and name are the entire line
  • there's always a comma separating the last name from the other names
  • there's always a middle name

Upvotes: 1

Related Questions