Reputation: 33
I have to make app that will extract data from website, but website is unformated and I do not know where to beggin. Can you tell me any ideas how to extract data like Name Adress from website? Data are in table and there is no ids for and stuff.
I started with this code:
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
$result = get_url_contents("***********");
$result = str_replace("<","<", $result);
$result = str_replace(">",">", $result);
echo nl2br($result);
So I am getting nice website code, but dont know how to continue.
Code is like this:
<td>
<h4 class="normal"><strong>Základní identifikační údaje</strong></h4>
</td>
</tr>
<tr>
<td>
<div class="dkLeftLine"></div>
</td>
<td>
Name:
</td>
<td>
<b>Mo******</b>
</td>
</tr>
<tr>
<td>
<div class="dkLeftLine"></div>
</td>
<td>
VAT:
</td>
<td>
<a href="****">
(******)
</a>
</td>
</tr>
<tr>
<td>
<div class="dkLeftLine"></div>
</td>
<td>
Rodné číslo / Datum nar.:
</td>
<td>
*****/**** / **.**.****
</td>
</tr>
<tr>
<td >
<div class="dkLeftLine"></div>
</td>
<td >
Bydliště:
</td>
<td>
****, ** ****** ***, *** *** **
</td>
</tr>
Upvotes: 0
Views: 41
Reputation: 824
Webscraping often deals with insufficiently structured data. Even well-structured sources using e.g. microformats are not necessary reliable, when e.g. a user entered their given name in the family name field.
Your sample seems structured enough to get at least some data:
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($markup);
$xpath = new \DOMXpath($dom);
$elements = $xpath->query('//tr/td');
foreach ($elements as $element) {
print trim($element->nodeValue) . PHP_EOL;
}
The first values printed in the loop do not have semantics, the second seem to be denominators and the third are the corresponding values, which you then can process.
Do note, that this is just a sample, you have to enhance the path queries.
Upvotes: 1