user3010273
user3010273

Reputation: 900

Diagnosing bottlenecks when fetching data from API

I am running a dedicated server that fetches data from an API server. My machine runs on a Windows Server 2008 OS.

I use PHP curl function to fetch the data via http requests ( and using proxy ). The function I've created for that:

function get_http($url)
{

$proxy_file = file_get_contents("proxylist.txt");
$proxy_file = explode("
", $proxy_file);

$how_Many_Proxies = count($proxy_file);

$which_Proxy = rand(0,$how_Many_Proxies);


$proxy = $proxy_file[$which_Proxy];

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);

return $curl_scraped_page;
}

I then save it in the MySQL database using this simple code that I run 20-40-60-100 versions in parallel with curl ( after some number, it doesn't increase performance and I wonder where is the bottleneck? ):

function retrieveData($id)
{

$the_data = get_http("http://api-service-ip-address/?id=$id");

return $the_data;   

}

$ids_List = file_get_contents("the-list.txt");
$ids_List = explode("
",$ids_List);

for($a = 0;$a<50;$a++)

{

$array[$a] = get_http($ids_List[$a]);

}


    for($b = 0;$b<50;$b++)
    {


    $insert_Array[] = "('$ids_List[$b]', NULL, '$array[$b]')";


    }
    $insert_Array = implode(',', $insert_Array);

    $sql = "INSERT INTO `the_data` (`id`, `queue_id`, `data`) VALUES $insert_Array;";

    mysql_query($sql);

After many optimizations, I am stuck on retrieving/fetching/saving around 23 rows with data per second.

The MySQL table is pretty simple and looks like this:

id | queue_id(AI) | data

Keep in mind, that the database doesn't seem to be the bottleneck. When I check the CPU usage, the mysql.exe process barely ever goes over 1%.

I fetch the data via 125 proxies. I've decreased the amount to 20 for the test and it DIDN'T make any difference ( suggesting that the proxies are not the bottleneck? - because I get the same performance when using 5 times less of them? )

So if the MySQL and Proxies are not the cause of the limit, what else can be it and how can I find out?

So far, the optimizations I've did:

My questions:

Upvotes: 0

Views: 421

Answers (1)

Sabuj Hassan
Sabuj Hassan

Reputation: 39385

  1. You are reading the proxy file every time you are calling the curl function. I recommend you to use the read operation outside the function. I mean read the proxies once, and store it in an array to reuse it.

  2. Use this curl option CURLOPT_TIMEOUT to defined a fixed amount of time for your curl execution(for example 3 seconds). It will help you to debug whether its the issue of curl operation or not.

Upvotes: 1

Related Questions