Reputation: 595
I'm trying to get all href urls except those that contain "get/index.php" and "PICSNUM"
<a href="/video5505298733/travel_and_tourism_recovery_coronavirus." title="The places and companies missing tourist dollars most.">The places and companies missing tourist dollars most.</a></p><p class="info"><span class="bg"><span class="duration">10 min</span><a href="/get/index.php?id=qafMsaaScGLPuKqGuanBpZjHtGHKppeHpJu5r6G9raaHoqa3tJS-ope5tJK6s5TLqp8"><span class="name">CORONAVIRUS</span></a><span><span class="bolder"> - </span> 1.7k <span class="bolder">Views</span></span><span class="text-disabled"><span class="bolder"> - </span> 2 days ago</span><span class="bolder"> - </span></span></p></div></div> <div class="thumb-lock "><div class="thumb-big"><div class="thumb"><a href="/midia54891337/PICSNUM/russia_fire_coronavirus_patients_intl"><img src="lightbox.gif" data-src="https://cdn-pic.cnews-cdn.com/videos/thumbs169/22/d3/a2/22d3a23423dfda7f5/22d3a2dfbb9fdfgd43f5.PICNUM.jpg" /></a>
I looked at this topic and how negative lookahead works but I don't think I understand how it works Regex to include one thing but exclude another
I tried this but it didn't work
(?<=href=")^(?!\/(get|PICSNUM))[a-z0-9-_\/.]+
https://regex101.com/r/bG8Rq4/2
I changed for that the result was better but still a part of the urls containing PICSNUM is still returning
(?<=href=")(?!\/(get|PICSNUM))[a-z0-9-_\/.]+
https://regex101.com/r/12HHHt/1
/video5505298733/travel_and_tourism_recovery_coronavirus.
/midia54891337/
Where am I going wrong? Regex is a little confusing to me
Upvotes: 0
Views: 54
Reputation: 163207
You could use a dom parser to get the value of the href. If you have found the values, you could use a negative lookahead to assert the string does not start with either /get
or contains /PICSNUM
The reason the pattern does not work yet is because /PICSNUM
does not directly follow after the first lookbehind.
^(?!(?:/get|\S*/PICSNUM))\S+
Regex demo | Php demo wit DOMDocument
You could use the alternation in the existing pattern, but that would not be very efficient.
Instead you could use a capturing group:
href="(?!(?:/get/index\.php|\S*/PICSNUM/))([a-z0-9-_/.]+)
Upvotes: 1