Reputation:
I'm working on a program that will run through a list of +20,000 records of ID, last name, first name, middle name. Now, I have a working regex that pulled the records with an ID sequence and grouped them as well as pulled records with a infraction number sequence and grouped them. The difference between the two is that the latter has a 12 character sequence (3 chars and 9 digits as opposed to a 9 digit ID sequence). There is the obvious problem of validating the names, some have last names that are 3+ i.e. de la Cruz, Smith-Doe, or just really long names. The same problem appears for middle names, which are sometimes just middle initial followed by a dot, simply the middle initial (no period), or the actual middle name.
I've created two classes to model the person objects, each with 4 fields (ID/tick num, lName, fName, mName). I want the regex to accurately group and store the 3 parts of a person's full name (as one person object which will be stored in a Vector) so I can later run a search against a person who is both the ticket list and ID list, to then store the matches in a separate list.
My problem is with how to accurately capture valid names. Here's a look at the regex I used to pull the two groups (this was done in python but I'm assuming I can reuse the regex:
'^([A-Z]{3}\d+)\s+([^\s]+([\D+])+)' --> Ticket group
'^(\d+)\s+([^\s]+([\D+])+)' ---> ID group
and here's a look at my ReadFile Class, which is meant to open and read the contents of the source file, storing the records as objects in people:
public class ReadFile {
private Scanner myScan;
public void openFile(){
try{
// Scanner object will hold source list
myScan = new Scanner(new File("C:\\source.txt"));
}
catch(Exception e){
System.out.println("Could not find file.");
}
}
// readFile method will iterate through and store the contents of source list into people
public void readFile() {
Vector<People> people = new Vector<People>();
while(myScan.hasNext()){
People person = new People();
person.setSbID(myScan.next());
person.setLastName(myScan.next());
person.setFirstName(myScan.next());
person.setmInit(myScan.next());
//add the person to the people list
people.add(person);
System.out.printf("%s %s %s %s \n", person.getID(), person.getLastName(), person.getFirstName(), person.getmInit());
}
}
public void closeFile(){
myScan.close();
}
}
Right now the data is being passed to the person fields as elements being read from the scanner object but it's not doing it in smart way (.next()). The regex I used was in a python script that parsed the data correctly, I'm just unsure how to go about implementing it in Java. Current excerpt from Scanner:
people.add(person);
String text = person.toString();
String pattern = "^(\\d+)\\s+([^\\s]+([\\D+]+)";
boolean matches = Pattern.matches(pattern, text);
if (true) { System.out.println("matches = " + person); }
Sample data that the program should handle:
092331234 Smith, John M.
ABC097853827 Doe, Mark J
Upvotes: 0
Views: 1614
Reputation: 5261
Here's a regex that will match your sample data, splitting it into the four parts:
^((?:[A-Z]{3})?\d{9})\s+(.+?),\s+(\S+)\s+(.+)$
See it work on regex101.
I would be surprised if each line is as similar as you say. I agree with the comment by @adamdc78 that there's no way to determine what's part of a multi-word first name versus middle name.
My regex also makes these assumptions:
Upvotes: 1