Reputation: 95
I have a problem with the use of very large files. I tried to make a solution by splitting the files into several parts. but the problem is still there because it's a large file. illustrations like this :
A = 1GB file;
I broke into
file A_1 = 200MB, file A_2 = 200MB, and so on.
My logic is , if I use one file (A), then I do :
for ( ... )
{
$data = file_get_contents("data/A.vcf");
//code that is very complex (including parsing the data) related to the contents of the data and I will often use file_get_contents due to looping
}
then I change my logic into several parts by using the value / position of the file, ie :
for ( ... )
{
switch($position)
{
case(($position >= 0) && ($position < 5000000)):
$data = file_get_contents("data/A_1.vcf");
break;
case(($position >= 5000000) && ($position < 10000000)):
$data = file_get_contents("data/A_2.vcf");
break;
case(($position >= 10000000) && ($position < 20000000)):
$data = file_get_contents("data/A_3.vcf");
break;
case(($position >= 20000000) && ($position < 30000000)):
$data = file_get_contents("data/A_4.vcf");
break;
...
}
//code that is very complex ( including parsing the data ) related to the contents of the data and I will often use file_get_contents due to looping
}
but the problem still remains, because of large data. I've tried to delete most of the data into 200KB, and solutions resolved. but that's not what I want, because the data is incomplete. is there any other solution to solve this problem ? whether due to the use of file_get_contents which cause so it can not? is there any other way to retrieve the value of a very large file?
[UPDATE]
<?php
/*
I take random data from multiple large files to try
50001374 rs389045667 T C
10000685 rs123308931 A C
39769437 rs393441165 C T
26907032 rs393470108 C T
50001195 rs122244329 G T
*/
$posi = array(50001374,10000685,39769437, 26907032, 50001195);
$id = array(".",".",".",".",".");
$ref = array("T","A","C","C","G");
$alt = array("C","C","T","T","T");
for($i=0; $i<5; $i++)
{
switch($posi[$i])
{
case (($posi[$i] >= 0 ) && ($posi[$i] < 5000000 )):
$data = file_get_contents("data/ncbi/5.vcf");
break;
case (($posi[$i] >= 5000000 ) && ($posi[$i] < 10000000 )):
$data = file_get_contents("data/ncbi/10.vcf");
break;
case (($posi[$i] >= 10000000 ) && ($posi[$i] < 20000000 )):
$data = file_get_contents("data/ncbi/20.vcf");
break;
case (($posi[$i] >= 20000000 ) && ($posi[$i] < 30000000 )):
$data = file_get_contents("data/ncbi/30.vcf");
break;
case (($posi[$i][2] >= 30000000 ) && ($posi[$i] < 40000000 )):
$data = file_get_contents("data/ncbi/40.vcf");
break;
case (($posi[$i] >= 40000000 ) && ($posi[$i] < 50000000 )):
$data = file_get_contents("data/ncbi/50.vcf");
break;
case ($posi[$i] >= 50000000 ):
$data = file_get_contents("data/ncbi/60.vcf");
break;
}
$data = explode("\n", $data);
$data2=array();
foreach ($data2 as $dat)
{
$data2[] = explode("\t", $dat);
}
for($j = 0 ; $j < count($data2); $j++)
{
if($data2[$j][1] == $posi[$i] && $data2[$j][3] == $ref[$i] && $data2[$j][4] == $alt[$i])
{
echo '<pre>';
print_r($posi[$i]. "\n");
print_r($id[$i]. "\n");
print_r($ref[$i]. "\n");
print_r($alt[$i]. "\n");
echo '</pre>';
break;
}
}
}
?>
explanation:
in this case, the position data is already sorted. in the code, I want it when "if($data2[$j][1] == $posi[$i] && $data2[$j][3] == $ref[$i] && $data2[$j][4] == $alt[$i])"
is true, then the file was released and out of the loop "for $j"
. then up to the beginning of the loop (for $i)
, and perform file selection (switch), and "if($data2[$j][1] == $posi[$i] && $data2[$j][3] == $ref[$i] && $data2[$j][4] == $alt[$i])"
, and so on.
So, i dont read all file, I just read the file until the position is found.
but I do not know how to free up the file. if I do the above code always fails on files that are too large.
Upvotes: 1
Views: 129
Reputation: 2707
Do reading line by line, also you can do the same with just 1 file, even if it is 1 GB (will just take longer):
<?php
/*
I take random data from multiple large files to try
50001374 rs389045667 T C
10000685 rs123308931 A C
39769437 rs393441165 C T
26907032 rs393470108 C T
50001195 rs122244329 G T
*/
$posi = array(50001374,10000685,39769437, 26907032, 50001195);
$id = array(".",".",".",".",".");
$ref = array("T","A","C","C","G");
$alt = array("C","C","T","T","T");
for($i=0; $i<5; $i++)
{
switch($posi[$i])
{
case (($posi[$i] >= 0 ) && ($posi[$i] < 5000000 )):
$file = "data/ncbi/5.vcf";
break;
case (($posi[$i] >= 5000000 ) && ($posi[$i] < 10000000 )):
$file = "data/ncbi/10.vcf";
break;
case (($posi[$i] >= 10000000 ) && ($posi[$i] < 20000000 )):
$file = "data/ncbi/20.vcf";
break;
case (($posi[$i] >= 20000000 ) && ($posi[$i] < 30000000 )):
$file = "data/ncbi/30.vcf";
break;
case (($posi[$i][2] >= 30000000 ) && ($posi[$i] < 40000000 )):
$file = "data/ncbi/40.vcf";
break;
case (($posi[$i] >= 40000000 ) && ($posi[$i] < 50000000 )):
$file = "data/ncbi/50.vcf";
break;
case ($posi[$i] >= 50000000 ):
$file = "data/ncbi/60.vcf";
break;
}
$handle = fopen($file, "r");
if ($handle) {
while (($line = fgets($handle, 4096)) !== false) {
$line = explode("\t", $line);
if($line[1] == $posi[$i] && $line[3] == $ref[$i] && $line[4] == $alt[$i]) {
echo '<pre>';
print_r($posi[$i]. "\n");
print_r($id[$i]. "\n");
print_r($ref[$i]. "\n");
print_r($alt[$i]. "\n");
echo '</pre>';
break;
}
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
}
}
Upvotes: 1