Tara
Tara

Reputation: 65

Regex for custom parsing

Regex isn't my strongest point. Let's say I need a custom parser for strings which strips the string of any letters and multiple decimal points and alphabets.

For example, input string is "--1-2.3-gf5.47", the parser would return "-12.3547". I could only come up with variations of this :

string.replaceAll("[^(\\-?)(\\.?)(\\d+)]", "")

which removes the alphabets but retains everything else. Any pointers?

More examples: Input: -34.le.78-90 Output: -34.7890

Input: df56hfp.78 Output: 56.78

Some rules:

Upvotes: 2

Views: 279

Answers (2)

isosceleswheel
isosceleswheel

Reputation: 1546

In terms of regex, the secondary, tertiary, etc., decimals seem tough to remove. However, this one should remove the additional dashes and alphas: (?<=.)-|[a-zA-Z]. (Hopefully the syntax is the same in Java; this is a Python regex but my understanding is that the language is relatively uniform).

That being said, it seems like you could just run a pretty short "finite state machine"-type piece of code to scan the string and rebuild the reduced string yourself like this:

a =  "--1-2.3-gf5.47"
new_a = ""
dash  = False
dot   = False
nums  = '0123456789'
for char in a:
    if char in nums:
        new_a = new_a + char  # record a match to nums
        dash  = True  # since we saw a number first, turn on the dash flag, we won't use any dashes from now on
    elif char == '-' and not dash:
        new_a = new_a + char  # if we see a dash and haven't seen anything else yet, we append it
        dash  = True  # activate the flag
    elif char == '.' and not dot:
        new_a = new_a + char  # take the first dot
        dot   = True  # put up the dot flag

(Again, sorry for the syntax, I think you need some curly backets around the statements vs. Python's indentation only style)

Upvotes: 0

matt
matt

Reputation: 2449

Just tested this on ideone and it seemed to work. The comments should explain the code well enough. You can copy/paste this into Ideone.com and test it if you'd like.

It might be possible to write a single regex pattern for it, but you're probably better off implementing something simpler/more readable like below.

The three examples you gave prints out:

--1-2.3-gf5.47   ->   -12.3547
-34.le.78-90     ->   -34.7890
df56hfp.78       ->    56.78

import java.util.*;
import java.lang.*;
import java.io.*;

/* Name of the class has to be "Main" only if the class is public. */
class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        System.out.println(strip_and_parse("--1-2.3-gf5.47"));
        System.out.println(strip_and_parse("-34.le.78-90"));
        System.out.println(strip_and_parse("df56hfp.78"));
    }

    public static String strip_and_parse(String input)
    {
        //remove anything not a period or digit (including hyphens) for output string
        String output = input.replaceAll("[^\\.\\d]", "");

        //add a hyphen to the beginning of 'out' if the original string started with one
        if (input.startsWith("-"))
        {
            output = "-" + output;
        }

        //if the string contains a decimal point, remove all but the first one by splitting
        //the output string into two strings and removing all the decimal points from the
        //second half           
        if (output.indexOf(".") != -1)
        {
            output = output.substring(0, output.indexOf(".") + 1) 
                   + output.substring(output.indexOf(".") + 1, output.length()).replaceAll("[^\\d]", "");
        }

        return output;
    }
}

Upvotes: 1

Related Questions