Reputation: 91
With Java Regex, I am not able to match URL's which have spaces, ( and ) brackets, below is a code example, can you please help. Only last URL's E.jpeg
works.
Code:
public static void main(String[] args) {
String content = "Lorem ipsum https://example.com/A B 123 4.pdf https://example.com/(C.jpeg https://example.com/D).jpeg https://example.com/E.jpeg";
extractUrls(content);
}
public static void extractUrls(String text) {
Pattern pat = Pattern.compile("(https?)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]", Pattern.CASE_INSENSITIVE);
Matcher matcher = pat.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
Output:
https://example.com/A
https://example.com/
https://example.com/D
https://example.com/E.jpeg
Expected output:
https://example.com/A B 123 4.pdf
https://example.com/(C.jpeg
https://example.com/D).jpeg
https://example.com/E.jpeg
Upvotes: 0
Views: 220
Reputation: 91
Answer from "The fourth bird" user solved this problem, regex should be:
http.*?\.(?:pdf|jpe?g)
Upvotes: 0
Reputation: 124
Take a look at this code:
import java.lang.Math;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MyClass {
public static void main(String[] args) {
String content = "Lorem ipsum https://example.com/A B 123 4.pdf https://example.com/(C.jpeg https://example.com/D).jpeg https://example.com/E.jpeg";
extractUrls(content);
}
public static void extractUrls(String text) {
Pattern pat = Pattern.compile("(https?)://(([\\S]+)(\\s)?)*", Pattern.CASE_INSENSITIVE);
Matcher matcher = pat.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
The output:
https://example.com/A B 123 4.pdf
https://example.com/(C.jpeg
https://example.com/D).jpeg
https://example.com/E.jpeg
Explaining:
I assume the file name does not have two consecutive blank spaces, as shown in the examples.
The (https?)://
identifies the substrings http://
or https://
.
We have two groups on this piece: (([\\S]+)(\\s)?
. It identifies 1 or more characters (other than white space) followed by only 1 or 0 blank characters.
With the character *
this process can be repeated several times.
Therefore our expression understands that if there are 2 or more blank spaces, it is the separation between two filenames.
I hope it helps.
Upvotes: 1