gb051
gb051

Reputation: 13

extracting data with regex

well i got a nice solution here but the regex split the string into "" string and 2 other splits i needed.

String  Result = "<ahref=https://blabla.com/Securities_regulation_in_the_United_States>Securities regulation in the United States</a> - Securities regulation in the United States is the field of U.S. law that covers transactions and other dealings with securities.";

String [] Arr =  Result.split("<[^>]*>");
for (String elem : Arr) {
    System.out.printf(elem);
}

the result is:

Arr[0]= ""
Arr[1]= Securities regulation in the United States
Arr[2]= Securities regulation in the United States is the field of U.S. law that covers transactions and other dealings with securities.

the Arr[1] and Arr[2] splits are fine I just cant get rid of the Arr[0].

Upvotes: 1

Views: 60

Answers (2)

Federico Piazza
Federico Piazza

Reputation: 30985

You can use an opposite regex to capture what you want by using a regex like this:

(?s)(?:^|>)(.*?)(?:<|$)

Working demo

IDEOne Code working

Code:

String line = "ahref=https://blabla.com/Securities_regulation_in_the_United_States>Securities regulation in the United States</a> - Securities regulation in the United States is the field of U.S. law that covers transactions and other dealings with securities.";

Pattern pattern = Pattern.compile("(?s)(?:^|>)(.*?)(?:<|$)");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
    System.out.println("group 1: " + matcher.group(1));
}

Upvotes: 2

Pshemo
Pshemo

Reputation: 124215

You can't avoid that empty string if you are using only split, especially since your regex is not zero-length.

You could try removing that first match placed at start of your input, and then split in rest of matches like

String[] Arr =  Result.replaceFirst("^<[^>]+>","").split("<[^>]+>")

But generally you should avoid using regex with HTML\XML. Try using parser instead like Jsoup.

Upvotes: 1

Related Questions