Bruno Andrade
Bruno Andrade

Reputation: 595

Get URLs and ignore others

I'm trying to get all href urls except those that contain "get/index.php" and "PICSNUM"

<a href="/video5505298733/travel_and_tourism_recovery_coronavirus." title="The places and companies missing tourist dollars most.">The places and companies missing tourist dollars most.</a></p><p class="info"><span class="bg"><span class="duration">10 min</span><a href="/get/index.php?id=qafMsaaScGLPuKqGuanBpZjHtGHKppeHpJu5r6G9raaHoqa3tJS-ope5tJK6s5TLqp8"><span class="name">CORONAVIRUS</span></a><span><span class="bolder"> - </span> 1.7k <span class="bolder">Views</span></span><span class="text-disabled"><span class="bolder"> - </span> 2 days ago</span><span class="bolder"> - </span></span></p></div></div>               <div class="thumb-lock "><div class="thumb-big"><div class="thumb"><a href="/midia54891337/PICSNUM/russia_fire_coronavirus_patients_intl"><img src="lightbox.gif" data-src="https://cdn-pic.cnews-cdn.com/videos/thumbs169/22/d3/a2/22d3a23423dfda7f5/22d3a2dfbb9fdfgd43f5.PICNUM.jpg"  /></a>

I looked at this topic and how negative lookahead works but I don't think I understand how it works Regex to include one thing but exclude another

I tried this but it didn't work

(?<=href=")^(?!\/(get|PICSNUM))[a-z0-9-_\/.]+

https://regex101.com/r/bG8Rq4/2

I changed for that the result was better but still a part of the urls containing PICSNUM is still returning

(?<=href=")(?!\/(get|PICSNUM))[a-z0-9-_\/.]+

https://regex101.com/r/12HHHt/1

/video5505298733/travel_and_tourism_recovery_coronavirus.
/midia54891337/

Where am I going wrong? Regex is a little confusing to me

Upvotes: 0

Views: 54

Answers (1)

The fourth bird
The fourth bird

Reputation: 163207

You could use a dom parser to get the value of the href. If you have found the values, you could use a negative lookahead to assert the string does not start with either /get or contains /PICSNUM

The reason the pattern does not work yet is because /PICSNUM does not directly follow after the first lookbehind.

^(?!(?:/get|\S*/PICSNUM))\S+

Regex demo | Php demo wit DOMDocument

You could use the alternation in the existing pattern, but that would not be very efficient.

Instead you could use a capturing group:

href="(?!(?:/get/index\.php|\S*/PICSNUM/))([a-z0-9-_/.]+)

Regex demo

Upvotes: 1

Related Questions