Sandeep Singh
Sandeep Singh

Reputation: 77

How to filtrate a long string (dynamic) with regex?

I have stored the response from a web-application in a string. The string contains several URL:s, and it is dynamic. Could be anything from 10-1000 URL:s.

I work with performance engineering, but this time I have to code a plugin in java, and I am far from an expert in programming.

The problem I have is that in my response-string, I have a lot of gibberish that I don't need, and I don't know how to filtrate it. In my print/request I only want to send the URLS.

I've come this far:

responseData = "http://xxxx-f.akamaihd.net/i/world/open/20150426/1370235-005A/EPISOD-65354-005A-016f1729028090bf_,892,144,252,360,540,1584,2700,.mp4.csmil/segment1_4_av.ts?null=" +
                "#EXTINF:10.000, " + 
                "http://xxxxx-f.akamaihd.net/i/world/open/20150426/1370235-005A/EPISOD-65365-005A-016f1729028090bf_,892,144,252,360,540,1584,2700,.mp4.csmil/segment2_4_av.ts?null=" + 
                "#EXTINF:fgsgsmoregiberish, " + 
                "http://xxxx-f.akamaihd.net/i/world/open/20150426/1370235-005A/EPISOD-6353-005A-016f1729028090bf_,892,144,252,360,540,1584,2700,.mp4.csmil/segment2_4_av.ts?null=";


            pattern = "^(http://.*\\.ts)";



             pr = Pattern.compile(pattern); 

             math = pr.matcher(responseData);


            if (math.find()) {
                System.out.println(math.group());


// in this print, I get everything from the response. I only want the URLS (dynamic. could be different names, but they all start with http and end with .ts). 
            }
            else {
                System.out.println("No Math");
            }

Upvotes: 0

Views: 74

Answers (3)

Kushagra Misra
Kushagra Misra

Reputation: 1

Use the following regex pattern:

(((http|ftp|https):\/{2})+(([0-9a-z_-]+\.)+([a-z]{2,4})(:[0-9]+)?((\/([~0-9a-zA-Z\#\+\%@\.\/_-]+))?(\?[0-9a-zA-Z\+\%@\/&\[\];=_-]+)?)?))\b

Explanation:

  • contains http or https or ftp with // : ((http|ftp|https):\/{2})
  • now add '+' sign to add next part in the same string
  • URL name with one . : ([0-9a-z_-]+.)
  • domain name : ([a-z]{2,4})
  • any digit occurs no or one time (here ? denote non or one time) : (:[0-9]+)?
  • rest url occurs non or one time : '(/([~0-9a-zA-Z#+\%@./_-]+))?(\?[0-9a-zA-Z+\%@/&[];=_-]+)?)'

Upvotes: 0

Pedro Lobito
Pedro Lobito

Reputation: 98921

Just make you regex lazy with .*? instead of greedy .*, i.e.:

pr = Pattern.compile("(https?.*?\\.ts)");

Regex demo:

https://regex101.com/r/nQ5pA7/1


Regex Explanantion:

(https?.*?\.ts)

Match the regex below and capture its match into backreference number 1 «(https?.*?\.ts)»
   Match the character string “http” literally (case sensitive) «http»
   Match the character “s” literally (case sensitive) «s?»
      Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
   Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
   Match the character “.” literally «\.»
   Match the character string “ts” literally (case sensitive) «ts»

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

Depending of how looks your URLs, you can use this naive pattern that works for your examples and stops before the ? (written in java style):

\\bhttps?://[^?\\s]+

to ensure there is .ts at the end, you can change it to:

\\bhttps?://[^?\\s]+\\.ts

or

\\bhttps?://[^?\\s]+\\.ts(?=[\\s?]|\\z)

to check that the end of the path is reached.

Note that these patterns don't deal with URLs that contain spaces between double quotes.

Upvotes: 2

Related Questions