dagamo
dagamo

Reputation: 25

preg_match_all How to get all links?

I'm trying to get all images links with preg_match_all those that begin with http://i.ebayimg.com/ and ends with .jpg , from page that I'm scraping.. I Can not do it correctly... :( I tried this but this is not what i need...:

preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $contentas, $img_link);

Same problem is with normal links... I don't know how to write preg_match_all to this:

<a class="link--muted" href="http://suchen.mobile.de/fahrzeuge/details.html?id=218930381&daysAfterCreation=7&isSearchRequest=true&withImage=true&scopeId=C&categories=Limousine&damageUnrepaired=NO_DAMAGE_UNREPAIRED&zipcode=&fuels=DIESEL&ambitCountry=DE&maxPrice=11000&minFirstRegistrationDate=2006-01-01&makeModelVariant1.makeId=3500&makeModelVariant1.modelId=20&pageNumber=1" data-touch="hover" data-touch-wrapper=".cBox-body--resultitem">

Thank you very much!!!

UPDATE I'm trying from here: http://suchen.mobile.de/fahrzeuge/search.html?isSearchRequest=true&scopeId=C&makeModelVariant1.makeId=1900&makeModelVariant1.modelId=10&makeModelVariant1.modelDescription=&makeModelVariantExclusions%5B0%5D.makeId=&categories=Limousine&minSeats=&maxSeats=&doorCount=&minFirstRegistrationDate=2006-01-01&maxFirstRegistrationDate=&minMileage=&maxMileage=&minPrice=&maxPrice=11000&minPowerAsArray=&maxPowerAsArray=&maxPowerAsArray=PS&minPowerAsArray=PS&fuels=DIESEL&minCubicCapacity=&maxCubicCapacity=&ambitCountry=DE&zipcode=&q=&climatisation=&airbag=&daysAfterCreation=7&withImage=true&adLimitation=&export=&vatable=&maxConsumptionCombined=&emissionClass=&emissionsSticker=&damageUnrepaired=NO_DAMAGE_UNREPAIRED&numberOfPreviousOwners=&minHu=&usedCarSeals= get cars links and image links and all information, with information is everything fine, my script works good, but i have problem with scraping images and links.. here is my script :

<?php

        $id= $_GET['id'];
        $user= $_GET['user'];
        $login=$_COOKIE['login'];

    $query = mysql_query("SELECT pavadinimas,nuoroda,kuras,data,data_new from mobile where vartotojas='$user' and id='$id'");
    $rezultatas=mysql_fetch_row($query);

    $url = "$rezultatas[1]";

    $info = file_get_contents($url); 

function scrape_between($data, $start, $end){
$data = stristr($data, $start); 
$data = substr($data, strlen($start));
$stop = stripos($data, $end);
$data = substr($data, 0, $stop);
return $data;
  }
     //turinio iskirpimas
    $turinys = scrape_between($info, '<div class="g-col-9">', '<footer class="footer">');
     //filtravimas naikinami mokami top skelbimai
    $contentas = preg_replace('/<div class="cBox-body cBox-body--topResultitem".*?>(.*?)<\/div>/', '' ,$turinys);
    //filtravimas baigtas

      preg_match_all('/<span class="h3".*?>(.*?)<\/span>/',$contentas,$pavadinimas); 

      preg_match_all('/<span class="u-block u-pad-top-9 rbt-onlineSince".*?>(.*?)<\/span>/',$contentas,$data); 

      preg_match_all('/<span class="u-block u-pad-top-9".*?>(.*?)<\/span>/',$contentas,$miestas);

      preg_match_all('/<span class="h3 u-block".*?>(.*?)<\/span>/', $contentas, $kaina);

      preg_match_all('/<a[A-z0-9-_:="\.\/ ]+href="(http:\/\/suchen.mobile.de\/fahrzeuge\/[^"]*)"[A-z0-9-_:="\.\/ ]\s*>\s*<div/s', $contentas, $matches);

   print_r($pavadinimas);
   print_r($data);
   print_r($miestas);
   print_r($kaina);
   print_r($result);
   print_r($matches);

   ?>

Upvotes: 0

Views: 3849

Answers (2)

Louis Barranqueiro
Louis Barranqueiro

Reputation: 10228

1. To capture src attribute starting by http://i.ebayimg.com/ of all img tags :

regex : /src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i

Here is an example :

$re = "/src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i"; 
$str = "codeOfHTMLPage"; 
preg_match_all($re, $str, $matches);

Check it in live : here

If you want to be sure that you capture this url on an img tag then use this regex (keep in mind that performance will decrease if page is very long) :

$re = "/<img(?:.*?)src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i";

2. To capture href attribute starting by http://i.ebayimg.com/ of all a tags :

regex : /href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i

Here is an example :

$re = "/href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i; 
$str = "codeOfHTMLPage"; 
preg_match_all($re, $str, $matches);

Check it in live : here

If you want to be sure that you capture this url on an a tag then use this regex (keep in mind that performance will decrease if page is very long) :

$re = "/<a(?:.*?)href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i";

Upvotes: 2

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89567

More handy with DOMDocument:

libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($yourURL);

$imgNodes = $dom->getElementsByTagName('img');

$result = [];

foreach ($imgNodes as $imgNode) {
    $src = $imgNode->getAttribute('src');
    $urlElts = parse_url($src);
    $ext = strtolower(array_pop(explode('.', $urlElts['path'])));
    if ($ext == 'jpg' && $urlElts['host'] == 'i.ebayimg.com')
        $result[] = $src;
}

print_r($result);

To get "normal" links, use the same way (DOMDocument + parse_url).

Upvotes: 1

Related Questions