krankit
krankit

Reputation: 62

Is there a faster way to parse a string for valid integers in Java?

My application is expecting json requests containing a (possible multi-dimensional) unsorted array with only integers and possible null values. Something like [6, 2, [4, 3],[[[5], nil], 1]]

Since I can't parse the invalid json, I've had to resort to using a regex to do the dirty work, and it is super slow.

The test case above for example takes about 1.xx seconds to complete, while a flat array with 10000 elements takes less than 1 second

Currently I'm getting the request body as a string and then applying the regex.

static ArrayList<Integer> getIntegers(String requestData) {
    // Apply a regex to the request body
    final String regularExpression = "([^\\d])+";
    // to get all the nested arrays
    Pattern pattern = Pattern.compile(regularExpression);
    String[] results = pattern.split(requestData);
    ArrayList<Integer> numbers = new ArrayList<>();
    // loop over the results and add to numbers array
    for (String result : results) {
        try {
            numbers.add(Integer.valueOf(result));
        } catch (NumberFormatException e) {
            // Catch and skip any non integers
        }

    }
    return numbers;
}

}

Is there anyway I can speed this up or is there maybe an alternative approach with better performance? If I need to process a multi-dimensional array with 20000 elements it will be way too slow.

Upvotes: 2

Views: 1302

Answers (5)

Lino
Lino

Reputation: 19935

I've tinkered around a bit and created following class:

class JsonNumberParser {
    private final String json;
    private final int length;
    private final List<Integer> result;
    private final char[] buffer = new char[64];
    private int bufferIndex = 0;

    public JsonNumberParser(String json) {
        this.json = json;
        length = json.length();
        result = new ArrayList<>(length);
    }

    public List<Integer> parse() {
        char c;
        for (int i = 0; i < length; i++) {
            c = json.charAt(i);
            // if we encounter a comma and the buffer contains data
            if (c == ',' && bufferIndex > 0) {
                // then we add the new number
                addBuffer();
                // and reset the buffer
                while (bufferIndex > 0) {
                    buffer[--bufferIndex] = '\0';
                }
            } else if (c == '-' || (c >= '0' && c <= '9')) {
                buffer[bufferIndex++] = c;
            }
        }
        // add the last possible number, if there was any
        if (bufferIndex > 0) {
            addBuffer();
        }

        // return the result
        return result;
    }

    private void addBuffer() {
        result.add(Integer.valueOf(new String(buffer, 0, bufferIndex)));
    }
}

Of course you could put all of that into a single method, but then you'd get some code duplication regarding the adding of Integers.

The way this parser works is, that it uses a buffer, to buffer digits, until we encounter a comma. That way we can have big numbers (up to 64 digits in this implementation) in the json.

You can use this like shown in the following example:

List<Integer> integers = new JsonNumberParser(jsonRequest).parse();

Regarding the performance, I expect this to be much faster than using a Regex. But I sadly don't have a benchmark setup at hand


Keep in mind, that this is not a validator, so a json string: [[,,,]}] would just produce an empty List


(Maybe) Improvements: I've thought and searched a bit more. Here are some improvements which could make the performance better:

1. One could just reset the buffer by assigning it with a new int[64], which would produce more garbage, but in the end may be faster.

2. The parsing of the number could be improved by using the answer suggested here. Which uses just plain old math and no creation of strings and parsing of integers.

Upvotes: 3

Holger
Holger

Reputation: 298449

This answer does already point into the right direction. The first important step is to move the expensive Pattern.compile operation out of the method, as Pattern instance can be reused.

Further, iterating over the number matches saves you from the array creation of split. Now, you may also skip the sub-String creation:

static final Pattern NUMBER = Pattern.compile("\\d+");
static ArrayList<Integer> getIntegers(String requestData) {
    ArrayList<Integer> numbers = new ArrayList<>();
    Matcher m = NUMBER.matcher(requestData);
    while(m.find()) numbers.add(Integer.parseInt(requestData, m.start(), m.end(), 10));
    return numbers;
}

Integer.parseInt(CharSequence s, int beginIndex, int endIndex, int radix) has been added in Java 9. If you are operating at an older version, you may create your own variant of it. For simplification, now only supporting a radix of 10:

static final Pattern NUMBER = Pattern.compile("-?\\d+");
static ArrayList<Integer> getIntegers(String requestData) {
    ArrayList<Integer> numbers = new ArrayList<>();
    Matcher m = NUMBER.matcher(requestData);
    while(m.find()) numbers.add(parseInt(requestData, m.start(), m.end()));
    return numbers;
}

static int parseInt(CharSequence cs, int start, int end) {
    int pos = start;
    if(pos >= end) throw format(cs, start, end);
    boolean negative = cs.charAt(pos) == '-';
    if((negative || cs.charAt(pos) == '+') && ++pos==end)
        throw format(cs, start, end);
    int value = 0;
    for(; pos < end; pos++) {
        int next = cs.charAt(pos) - '0';
        if(next < 0 || next > 9) throw format(cs, start, end);
        if(value < Integer.MIN_VALUE/10) throw size(cs, start, pos, end);
        value = value * 10 - next;
    }
    if(value > 0 || !negative && value == Integer.MIN_VALUE)
        throw size(cs, start, pos, end);
    return negative? value: -value;
}
private static RuntimeException format(CharSequence cs, int start, int end) {
    return start > end? new IndexOutOfBoundsException(end+" < "+start):
        new NumberFormatException(start == end?
            "empty string": cs.subSequence(start, end).toString());
}
private static RuntimeException size(CharSequence cs, int start, int pos, int end) {
    for(; pos < end; pos++) 
        if(cs.charAt(pos) < '0' || cs.charAt(pos) > '9') return format(cs, start, end);
    return new NumberFormatException(cs.subSequence(start, end)+" outside the int range");
}

Upvotes: 3

CoronA
CoronA

Reputation: 8095

You can gain better performance by parsing positive patterns (e.g. \d+) instead of negative ones ([^\d]+).

private static final Pattern NUMBER = Pattern.compile("\\d+");

List<Integer> extractNumbersRegex(String str) throws IOException {
    Matcher m = NUMBER.matcher(str);
    ArrayList<Integer> numbers = new ArrayList<>();
    while (m.find()) {
        numbers.add(Integer.parseInt(m.group()));
    }
    return numbers;
}

This would be ok for extracting from strings, yet for large data one may switch to more efficient that do not depend on regular expressions but on directly matching chars:

List<Integer> extractNumbersHandcoded(String str) throws IOException {
    ArrayList<Integer> numbers = new ArrayList<>();
    int start = 0;
    while (start < str.length()) {
        if (Character.isDigit(str.charAt(start))) {
            break;
        } 
        start++;
    }
    int bufferedInt = 0;
    for (int i = start; i < str.length(); i++) {
        char c = str.charAt(i);
        if (Character.isDigit(c)) {
            bufferedInt = bufferedInt * 10 + (c - '0');
        } else {
            numbers.add(bufferedInt);
            bufferedInt = 0;
        }
    }
    return numbers;
}

If your data is as large that it comes as stream you may consider a solution with a Streamtokenizer:

List<Integer> extractNumbersStreamTokenizer(String str) throws IOException {
    StreamTokenizer s = new StreamTokenizer(new StringReader(str));
    ArrayList<Integer> numbers = new ArrayList<>();
    int token;
    while ((token = s.nextToken()) != StreamTokenizer.TT_EOF) {
        if (token == StreamTokenizer.TT_NUMBER) {
            numbers.add((int) s.nval);
        }
    }
    return numbers;
}

All solutions assume that the data contains only integer literals (not float literals).

Upvotes: 0

YetAnotherBot
YetAnotherBot

Reputation: 2086

How about using a stack?

We can upgrade balanced braces problem.

While iterating the string, if the character is notBracket(), then it should be a number. Needless to say, you ignore all commas. Simultaneously, it will also verify the array structure.

This has an amortized complexity of O(n).

Upvotes: 0

SEY_91
SEY_91

Reputation: 1687

If the performance is the issue in your case I don't think stream API will be a good solution.

static ArrayList<Integer> getIntegers(String requestData) {
            char[] charArray = requestData.toCharArray();
             ArrayList<Integer> numbers = new ArrayList<>();
            for(char c : charArray) {

                if(Character.isDigit(c)) {
                    numbers.add(Integer.valueOf(c) - 48);
                }
            }
            return numbers;
        }

Upvotes: 0

Related Questions