Reputation: 900
I am running a dedicated server that fetches data from an API server. My machine runs on a Windows Server 2008 OS.
I use PHP curl function to fetch the data via http requests ( and using proxy ). The function I've created for that:
function get_http($url)
{
$proxy_file = file_get_contents("proxylist.txt");
$proxy_file = explode("
", $proxy_file);
$how_Many_Proxies = count($proxy_file);
$which_Proxy = rand(0,$how_Many_Proxies);
$proxy = $proxy_file[$which_Proxy];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
return $curl_scraped_page;
}
I then save it in the MySQL database using this simple code that I run 20-40-60-100 versions in parallel with curl ( after some number, it doesn't increase performance and I wonder where is the bottleneck? ):
function retrieveData($id)
{
$the_data = get_http("http://api-service-ip-address/?id=$id");
return $the_data;
}
$ids_List = file_get_contents("the-list.txt");
$ids_List = explode("
",$ids_List);
for($a = 0;$a<50;$a++)
{
$array[$a] = get_http($ids_List[$a]);
}
for($b = 0;$b<50;$b++)
{
$insert_Array[] = "('$ids_List[$b]', NULL, '$array[$b]')";
}
$insert_Array = implode(',', $insert_Array);
$sql = "INSERT INTO `the_data` (`id`, `queue_id`, `data`) VALUES $insert_Array;";
mysql_query($sql);
After many optimizations, I am stuck on retrieving/fetching/saving around 23 rows with data per second.
The MySQL table is pretty simple and looks like this:
Keep in mind, that the database doesn't seem to be the bottleneck. When I check the CPU usage, the mysql.exe process barely ever goes over 1%.
I fetch the data via 125 proxies. I've decreased the amount to 20 for the test and it DIDN'T make any difference ( suggesting that the proxies are not the bottleneck? - because I get the same performance when using 5 times less of them? )
So if the MySQL and Proxies are not the cause of the limit, what else can be it and how can I find out?
So far, the optimizations I've did:
replaced file_get_contents with curl functions for retrieving the http data
replaced the https:// url for a http:// one ( is this faster? )
indexed the table
replaced the API domain name that is called by a pure IP address ( so the DNS time isn't a factor )
I use only private proxies that have low latency.
My questions:
What may be the possible cause of the performance limit?
How do I find the reason for the limit?
Can this be caused by some TCP/IP limitation / poorly configured apache/windows?
The API is really fast and it serves many times more queries to other people so I don't believe it can't respond any faster.
Upvotes: 0
Views: 421
Reputation: 39385
You are reading the proxy file every time you are calling the curl function. I recommend you to use the read operation outside the function. I mean read the proxies once, and store it in an array to reuse it.
Use this curl option CURLOPT_TIMEOUT to defined a fixed amount of time for your curl execution(for example 3 seconds). It will help you to debug whether its the issue of curl operation or not.
Upvotes: 1