Drifter64
Drifter64

Reputation: 1123

How to find and get multiple substrings within a string in java, and whether regex is needed

Let's say we have a String in Java, which contains HTML code.

I would like to do something like return every substring within this string that contains "<li>stuff here</li>". I realize also that the leading li tag may have parameters. The BIGGEST problem is that there might be multiple <li></li> pairs in one line, especially if whoever wrote the HTML likes to have everything compressed and less human readable! ;)

I have thought a while about using things like string split, and going through the array of strings programatically, throwing a boolean flag to true when im in a <li> tag, and false when i exit. Maybe this will work, but it feels very non-elegant.

How can i design a method that returns, say an ArrayList<String> of all results? Can i do this without regex? I have looked up regex and it seems powerful, but sometimes the syntax can be very complicated. If i have to resort to regex i will, but simpler more clear solutions are appreciated!

If there is no elegant and clear way without regex, i will deal with the regex patterns.

Upvotes: 0

Views: 1586

Answers (1)

GameDroids
GameDroids

Reputation: 5662

I think the Regex does not need to be so complex at all. What you want to do (if I understand it correctly) is actually get rid of everything that looks like <li> or even any other html-tag and just keep the rest.

  String test = "<li>stuff here</li>" ;
  String[] split = test.split("(<.*?>)");
  System.out.println(Arrays.toString(split));

if you run that code it will return you this

[, stuff here]

The regex:
( regex here ) -> the braces mean you are looking for something that matches the regex within the braces. As with always you can use the braces to pair more regex together to on big regex... any way:

<.*?>

< means: "I want something that starts with a <"
. means "after my < there can be anything: a letter, a number, some special sign... just anything"
* means "there can be as many of the letters or numbers or special signs, as there want to be"
? means: "I even accept it when there is nothing at all (after the <)" EDIT: at least until the next 'match'.. (see comment of CAustin – thanks!)
> means: well what ever I find between my first < and this > now, I don't care, I just found my regex"

So you can find anything.. for example:

<li>
</li>
<title>
<div id="todeloot">
</tr>

Everything with a < at the beginning and a > at the end will match your regex.

Now the split method will cut your html-String into many little Strings and put them into an array. But it will leave out the things you find with the regex. meaning the <title> or the <li> will just get swallowed.

Example:

<html><body><H1>hello world</h1><li>list item 1</li><li>list item 2</li> well that was my list.</body></html>

would result in:

[, , , hello world, , list item 1, , list item 2,  well that was my list.]

yeah and the empty Strings at the beginning or in the middle will appear when there is no text between two html-tags.

Another Example

"(<li.*?>)" – will cut the String right at every <li> element (with or without additional html-parameters like id or name or whatever
"(<.?li.*?>)" – will match every String that looks like <li> or </li> (also with or without additional parameters)

Upvotes: 1

Related Questions