membersound
membersound

Reputation: 86915

Extract between html tag with unknown tagname?

<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>....

I want to extract everything that comes after <b>Topic1</b> and the next <b> starting tag. Which in this case would be: <ul>asdasd</ul><br/>.

Problem: it must not necessairly be the <b> tag, but could be any other repeating tag.

So my question is: how can I dynamically extract those text? The only static thinks are:

I know how to write the java code, but what would the regex be like?

String regex = ">Topic1<";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
    for (int i = 1; i <= m.groupCount(); i++) {
        System.out.println(m.group(i));
    }
}

Upvotes: 1

Views: 201

Answers (2)

mcjcloud
mcjcloud

Reputation: 371

Try this

String pattern = "\\<.*?\\>Topic1\\<.*?\\>"; // this will see the tag no matter what tag it is
String text = "<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b>"; // your string to be split
String[] attributes = text.split(pattern);
for(String atr : attributes) 
{
    System.out.println(atr);
}

Will print out:

<ul>asdasd</ul><br/><b>Topic2</b>

Upvotes: 0

Martin Konecny
Martin Konecny

Reputation: 59681

The following should work

Topic1</(.+?)>(.*?)<\\1>

Input: <b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>

Output: <ul>asdasd</ul><br/>

Code:

    Pattern p = Pattern.compile("Topic1</(.+?)>(.*?)<\\1>");
    //  get a matcher object
    Matcher m = p.matcher("<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>");
    while(m.find()) {
        System.out.println(m.group(2));  // <ul>asdasd</ul><br/>
    }

Upvotes: 2

Related Questions