coolDude
coolDude

Reputation: 717

RegEx for Complex String

I am new to using RegEx and am trying to use it with the Java engine. An example string that I am trying to parse is the following:

name:"SFATG";affil:100;aup:1;bu:FALSE name:"SF TAC 1";affil:29.3478;aup:19;bu:FALSE name:"SF TAC 2";affil:22.2222;aup:14;bu:FALSE name:"SF TAC 3";affil:44.4444;aup:0;bu:FALSE name:"SF DISP 4";affil:82.4742;aup:0;bu:FALSE 

What I would hope the RegEx to achieve would be to only extract the values that appear after the : and before the ;. In addition, I would not want to include the quotes within the entries for name. However, I (in this very particular case) would like to keep the space which appears in the entry for bu. I would not, however, want to have the name field appear for the data entry of bu though. So I'd want FALSE, not FALSE name for this field.

My ultimate goal for using this RegEx would be to create an array from all of the groups/data values so that the array would contain the following:

[0]: SFATG
[1]: 100
[2]: 1
[3]: FALSE 
[4]: SF TAC 1
...Etc.

I was thinking about creating groups for each value because then I would be able to easily create an array by combining the Pattern and Matcher classes, such that:

String regEx = "Some really fancy RegEx that actually works";
Pattern p = Pattern.compile(regEx);
Matcher m = p.matcher("Some really really long String that follows the outlined format");
// I'd probably want to use an Object array since my data values vary by type
// I can also create 4 different arrays (one for name, another for affil, etc.),
// Any advice on which approach to take?
Object[] dataValues = new Object[m.groupCount()];

The RegEx that I've so far been able to come up with is as follows:

name:"(\w+)";affil:(\d+);aup:(\d+);bu:(\w+\s)

However, this seems to only work on the first 4 data values and none beyond that.

Would anyone be able to assist me on creating a RegEx for the data that I am working with? Any assistance on this would be greatly appreciated! I'm also open to any ideas on how else to approach this, such as using a different data type for storing the data afterwards (other than creating an Object array). The key is to somehow obtain the data values from the string that I've mentioned and storing them for processing that will occur later on.

Additional Question I'd imagine that there may be external libraries that may have been better fit to perform this task. Is anyone aware of a library that would work for this?

Upvotes: 3

Views: 178

Answers (2)

teppic
teppic

Reputation: 7286

While this regex is less general purpose than @Jan's answer, it does restrict matches to the fields in your data, so it will provide syntax checking:

name:"([^"]+)";affil:([\d.]+);aup:(\d+);bu:(TRUE|FALSE) ?

Regarding the approach to extracting the values, I'd create a thin wrapper object to provide type safety:

public class RowParser {
    private static final Pattern ROW_PATTERN = Pattern.compile("name:\"([^\"]+)\";affil:([\\d.]+);aup:(\\d+);bu:(TRUE|FALSE) ?");

    public static void main(String[] args) {
        String data = "name:\"SFATG\";affil:100;aup:1;bu:FALSE name:\"SF TAC 1\";affil:29.3478;aup:19;bu:FALSE name:\"SF TAC 2\";affil:22.2222;aup:14;bu:FALSE name:\"SF TAC 3\";affil:44.4444;aup:0;bu:FALSE name:\"SF DISP 4\";affil:82.4742;aup:0;bu:TRUE \n";
        System.out.println(parseRows(data));
    }

    public static List<Row> parseRows(String data) {
        Matcher matcher = ROW_PATTERN.matcher(data);
        List<Row> rows = new ArrayList<>();
        while (matcher.find()) {
            rows.add(new Row(matcher));
        }
        return rows;
    }

    // Wrapper object for individual data rows
    public static class Row {
        private String name;
        private double affil;
        private int aup;
        private boolean bu;

        Row(Matcher matcher) {
            this.name = matcher.group(1);
            this.affil = Double.parseDouble(matcher.group(2));
            this.aup = Integer.parseInt(matcher.group(3));
            this.bu = Boolean.parseBoolean(matcher.group(4));
        }

        public String getName() {
            return name;
        }

        public double getAffil() {
            return affil;
        }

        public int getAup() {
            return aup;
        }

        public boolean isBu() {
            return bu;
        }

        @Override
        public String toString() {
            return "name:\"" + name + '"' + ";affil:" + affil + ";aup:" + aup + ";bu:" + String.valueOf(bu).toUpperCase();
        }
    }
}

Upvotes: 1

Jan
Jan

Reputation: 43169

One regex to rule them all

\w+:(?:"([^"]+)"|(\d+)(?=;|\Z)|(\d+\.\d+)|([A-Z]+\s))

See a demo on regex101.com.


Broken down, this says:

\w+:                 # 1+ word characters, followed by :
(?:                  # a non-capturing group
    "([^"]+)"        # "(...)"
    |                # or
    (\d+)(?=;|\Z)    # only digits (no floats)
    |                # or
    (\d+\.\d+)       # floats
    |                # or
    ([A-Z]+\s)       # only UPPERCASE, followed by space
)

Here, you'll need to see which capture group was filled, additionally two backslashes are needed in Java (ie. \\d+ instead of \d+). To check which group was matched, you'll need some programming logic, e.g. https://ideone.com/sbgZxY (I'm not a Java guy though).

Upvotes: 4

Related Questions