Knight
Knight

Reputation: 223

Extract stuff from text using Regex and Java

I have some text like this:

 //(10,0,'Computer_accessibility','',''),(13,0,'History_of_Afghanistan','',''),(14,0,'Geography_of_Afghanistan','','')

and I wrote a pattern:

public final static Pattern r_english = Pattern.compile("\\((.*?),(.*?),(.*?),(.*?),(.*?)\\)");

This works well in Java to extract m.group(1) (e.g. 13) and m.group(3) (e.g. History_of_Afghanistan) where m is a matcher. However, it breaks if the text is like this, since Washington,_D.C. (ie. m.group(3)) has a comma in it:

(8543,0,'Washington,_D.C.','',''),(8546,0,'Extermination_camp','','')

Can someone help me in with the regex to modify it and extract the Washington,_D.C. thingy? Thanks

Upvotes: 2

Views: 123

Answers (3)

Eder
Eder

Reputation: 1884

You need to change your regular expression in order to fit all the matchings that you want to retrieve, E.g.:

/((.*?),(.*?),'(.*?)','(.*?)','(.*?)'\)/g

Working Example @ regex101

You need to translate/escape the above regular expression into a Java compatible one, E.g.:

public static String REGEX_PATTERN = "\\((.*?),(.*?),'(.*?)','(.*?)','(.*?)'\\)";

Then, iterate through all the matchings trying to mimic the //g modifier, E.g.:

while (matcher.find()) {
}

Java Working Example:

package SO40002225;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static String INPUT;
    public static String REGEX_PATTERN;

    static {
        INPUT = "(8543,0,'Washington,_D.C.','',''),(8546,0,'Extermination_camp','',''),(8543,0,'Washington,_D.C.','',''),(8546,0,'Extermination_camp','','')";
        REGEX_PATTERN = "\\((.*?),(.*?),'(.*?)','(.*?)','(.*?)'\\)";
    }


    public static void main(String[] args) {
        String text = INPUT;

        Pattern pattern = Pattern.compile(REGEX_PATTERN);
        Matcher matcher = pattern.matcher(text);

        while (matcher.find()) {
            String mg1 = matcher.group(1);
            String mg2 = matcher.group(2);
            String mg3 = matcher.group(3);
            String mg4 = matcher.group(4);
            String mg5 = matcher.group(5);

            System.out.println("Matching group #1: " + mg1);
            System.out.println("Matching group #2: " + mg2);
            System.out.println("Matching group #3: " + mg3);
            System.out.println("Matching group #4: " + mg4);
            System.out.println("Matching group #5: " + mg5);
        }

    }

}

Update #1

Removed the escape done for commas , with-in the regular expression, as pointed by Pshemo, the , is not a meta-character or it's not being used within a limit repetition quantifier: {min, max}

Upvotes: 1

f1sh
f1sh

Reputation: 11943

Change your third capture group to capture everything until a closing ' is reached. That allows every character (including your comma) to be captured.

UPDATE: to allow escaped 's as well, the regex looks like this. Credits go to Pshemo, see the comments.

public final static Pattern r_english = Pattern.compile("\\((.*?),(.*?),('(?:[^']|\\')*'),(.*?),(.*?)\\)");

Upvotes: 3

Evgeni
Evgeni

Reputation: 11

You should help to make your RegEx more specific to your case. For example:

((.*?),(.*?),('.*?'),('.*?'),('.*?'))

I used the parantehesis ', this solution is also agnostic to further parantehesis in Group 3-5.

Regards

Upvotes: 1

Related Questions