Java Regex to validate and capture ID and full name

Question

I'm working on a program that will run through a list of +20,000 records of ID, last name, first name, middle name. Now, I have a working regex that pulled the records with an ID sequence and grouped them as well as pulled records with a infraction number sequence and grouped them. The difference between the two is that the latter has a 12 character sequence (3 chars and 9 digits as opposed to a 9 digit ID sequence). There is the obvious problem of validating the names, some have last names that are 3+ i.e. de la Cruz, Smith-Doe, or just really long names. The same problem appears for middle names, which are sometimes just middle initial followed by a dot, simply the middle initial (no period), or the actual middle name.

I've created two classes to model the person objects, each with 4 fields (ID/tick num, lName, fName, mName). I want the regex to accurately group and store the 3 parts of a person's full name (as one person object which will be stored in a Vector) so I can later run a search against a person who is both the ticket list and ID list, to then store the matches in a separate list.

My problem is with how to accurately capture valid names. Here's a look at the regex I used to pull the two groups (this was done in python but I'm assuming I can reuse the regex:

'^([A-Z]{3}\d+)\s+([^\s]+([\D+])+)'  --> Ticket group
'^(\d+)\s+([^\s]+([\D+])+)'  ---> ID group

and here's a look at my ReadFile Class, which is meant to open and read the contents of the source file, storing the records as objects in people:

public class ReadFile {
    private Scanner myScan;

    public void openFile(){
        try{
            // Scanner object will hold source list
            myScan = new Scanner(new File("C:\source.txt"));
        }
        catch(Exception e){
            System.out.println("Could not find file.");
        }
    }

    // readFile method will iterate through and store the contents of source list into people
    public void readFile() {
        Vector people = new Vector();
        while(myScan.hasNext()){
            People person = new People();
            person.setSbID(myScan.next());
            person.setLastName(myScan.next());
            person.setFirstName(myScan.next());
            person.setmInit(myScan.next());
            //add the person to the people list
            people.add(person);

            System.out.printf("%s %s %s %s 
", person.getID(), person.getLastName(), person.getFirstName(), person.getmInit());
        }
    }

    public void closeFile(){
        myScan.close();
    }
}

Right now the data is being passed to the person fields as elements being read from the scanner object but it's not doing it in smart way (.next()). The regex I used was in a python script that parsed the data correctly, I'm just unsure how to go about implementing it in Java. Current excerpt from Scanner:

people.add(person);
String text = person.toString();
String pattern = "^(\d+)\s+([^\s]+([\D+]+)";
boolean matches = Pattern.matches(pattern, text);
if (true) { System.out.println("matches = " + person); }

Sample data that the program should handle:

092331234 Smith, John M.
ABC097853827 Doe, Mark J

Java Regex to validate and capture ID and full name

Answers (1)

Related Questions