Reputation: 11
I'm trying to scrape a website page, but the page is hidden behind a form. I was trying to do this with some PHP and the simple_html_dom.php library. Unfortunately the action link of the form appears to be dynamically generated, as I am only able to scrape the initial part of the link
I used the following code
<?php
require 'simple_html_dom.php';
$formPageUrl = "https://example.com/form-page";
$html = file_get_html($formPageUrl);
$form = $html->find('form', 0);
if (!$form) {
die("Form not found.");
}
$actionUrl = $form->action;
if (!parse_url($actionUrl, PHP_URL_SCHEME)) {
$actionUrl = rtrim($formPageUrl, '/') . '/' . ltrim($actionUrl, '/');
}
$formData = [];
foreach ($form->find('input') as $input) {
$name = $input->name;
$value = $input->value ?? '';
if ($name === 'username') {
$value = 'my_username';
} elseif ($name === 'password') {
$value = 'my_password';
}
if ($name) {
$formData[$name] = $value;
}
}
// Invia i dati del form usando cURL
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $actionUrl);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($formData));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
if (curl_errno($ch)) {
die("Error cURL: " . curl_error($ch));
}
curl_close($ch);
echo $response;
It was giving me nothing at all as a response, by echoing the action link I found it doesn't match with he one on the page For example:
I get /it/it/page/
But the actual action link contains a random string: /it/it/page/Aihrkjrnjfvijkregv1,
By inspecting the browser console Network tab, that string is indeed used as a payload to get the page and it changes everytime you start a new session, preventing me from replicating the request. I'm kinda new to web scraping, so any useful advice is appreciated.
Upvotes: 0
Views: 55