Extract between html tag with unknown tagname?

Question

Topic1asdasd

Topic2....

I want to extract everything that comes after Topic1 and the next starting tag. Which in this case would be:

asdasd

.

Problem: it must not necessairly be the tag, but could be any other repeating tag.

So my question is: how can I dynamically extract those text? The only static thinks are:

The signal keyword to look for is always "Topic1". I'd like to take the surrounding tags as the one to look for.

The tag is always repeated. In this case it's always , it might as well be or or
etc.

I know how to write the java code, but what would the regex be like?

String regex = ">Topic1<"; Matcher m = Pattern.compile(regex).matcher(text); while (m.find()) { for (int i = 1; i <= m.groupCount(); i++) { System.out.println(m.group(i)); } }

Martin Konecny · Accepted Answer

The following should work

Topic1(.*?)<\1>

Input: Topic1

asdasd


Topic2

Output:

asdasd

Code:

    Pattern p = Pattern.compile("Topic1(.*?)<\1>");
    //  get a matcher object
    Matcher m = p.matcher("Topic1asdasd

Topic2");
    while(m.find()) {
        System.out.println(m.group(2));  // asdasd


    }

Extract between html tag with unknown tagname?

Answers (2)

Related Questions