kPalladyn
kPalladyn

Reputation: 13

Regex expression for multiple patterns in 1 line

I am scraping information from a log that I need 3 elements. Another added difficulty is that I am parsing the log via readLine() in my java program aka one(1) line at a time. (If there is a possibility to read multiple lines when parsing let me know :) ) NOTE: I have no control over the log output format.

There are 2 possibilities of what I must extract. Either the log is nice and gives the following

NICE FORMAT

.text.rank     0x0000000000400b8f      0x351 is_x86.o

where I must grab .text.rank , 0x0000000000400b8f , and 0x351

Now the not so nice case: If the name is too long, it bumps everything else to the next line like is below, now the only thing after the first element is one blank space followed by a newline (\n) which gets clobbered by readLine() anyway.

EVIL FORMAT : Note each line is in a separate arraylist entry.

.text.__sfmoreglue 
            0x0000000000401d00       0x55 /mnt/drv2homelibc_popcorn.a(lib_a-findfp.o)

Therefore what the regex actually sees is:

.text.__sfmoreglue

CORNER CASE FORMAT that also occurs within the log but I DO NOT want

 *(.text.unlikely)

Finally below is my Pattern line I am currently using for the first line and pline2 is what is used on the next line when group 2 of the first line is empty.

UPDATE: The pattern below works for the NICE FORMAT and EVIL FORMAT But now pattern pline2 has no matches, even though on regex101.com it is correct. Link: https://regex101.com/r/vS7vZ3/9

UPDATE2: I fixed it, I forgot to add m2.find() once I compiled the second line with Pattern pline2. Corrected code is below.

Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\@a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");

Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");

To give a little background I am first matching the name .text.whatever to m.group(1) followed by the address 0x000012345 to m.group(2) and finally the size 0xa48 to m.group(3). This is all assuming the log is in the NICE format. If it is in the EVIL format I see that group(2) is empty and therefore readin the next line of the log to a temp buffer and apply the second pattern pline2 to new line.

Can someone help me with the regex? Is there a way I can make sure my current line (or even better, just the second grouping) is either the NICE FORMAT or is empty?

As requested my java code:

//1st line pattern
Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\@a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");
//conditional 2nd line pattern
Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");
while((temp = br1.readLine()) != null){
        Matcher m = p.matcher(temp);
        while(m.find()){
            System.out.println("What regex finds: m1:"+m.group(1)+"#    m2:"+m.group(2)+"#    m3:"+m.group(3));
            if(!m.group(1).isEmpty() && m.group(2).isEmpty() && m.group(3).isEmpty()){
                //means we probably hit a long symbol name and important stuff is on the next line
                //save the name at least
                name = m.group(1);
                //read and utilize the next line
                if((temp = br1.readLine()) == null){
                    return;
                }
                System.out.println("EVILline2:"+temp); //sanity check the input 
                System.out.println(pline2.toString()); //sanity check the regex
                Matcher m2= pline2.matcher(temp);
                while(m2.find()){
                       System.out.println("regex line2 finds: m1:"+m2.group(1));//+"#    m2:"+m2.group(2));
                       if(m2.group(2).isEmpty()){
                             size = 0;
                       }else{
                             size = Long.parseLong(m2.group(2).replaceFirst("0x", ""),16);
                       }

                       addr = Long.parseLong(m2.group(1).replaceFirst("0x", ""),16);
                       System.out.println("#########LONG NAME: "+name+"    addr:"+addr+"    size:"+size);
                  }
            }//end if
            else{ // assume in NICE FORMAT
                //do nice format stuff.
        }//end while
}//end outerwhile

An Aside, The output I currently get:

line: .text.c_print_results
What regex finds: m1:.text.c_print_results#    m2:#    m3:
EVIL FORMATline2:                0x00000000004001e6      0x231 c_print_results_x86.o
^\s*([x0-9a-f]*)[ \s]*([x0-9a-f]*)\s*[\w\(\)\.\-]*
Exception in thread "main" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:536)
at java.util.regex.Matcher.group(Matcher.java:496)
at regexTest.regex.grabSymbolsInRange(regex.java:143)
at regexTest.regex.main(regex.java:489)

Upvotes: 1

Views: 223

Answers (1)

Rodrigo López
Rodrigo López

Reputation: 4259

You have a few issues with your pattern.

  • 1st is the separation of first and second groups (that's why group 2 is returning null).
  • You have 4 groups and you need 3
  • After capturing your 3 values you can stop matching, so pattern after last group isn't necessary
  • you need global modifier \g so it returns all matches

So, instead of your posted Regex, you can try:

(\\.[tex]*\\.[\\._\\-\\@a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]+([x0-9a-f]*)/g

Tested on Regex101.com:

https://regex101.com/r/lM4bQ9/1

Other then that, a few suggestions:

  • if you know your text is going to start with text, just put it on the pattern, don't use [tex]*, which will require a few extra work from the engine.
  • [ \s] is the same thing of \s.
  • [\._\-\@a-zA-Z0-9]* from what i understood, is basically everything but space, so why not just use [^\s]*

So having these in mind I would suggest you to use this pattern instead:

(\\.text\\.[^\\s]*)\\s*([x0-9a-f]*)\\s+([x0-9a-f]*)/g

Upvotes: 1

Related Questions