Reputation: 693
Note: I'll start off by saying that I know I'm probably missing something really obvious. I'm in one of those coding hazes where I can't see the simple solution.
Issue: I've written a script in PHP to parse a .csv file, select the column containing e-mail addresses, and put them into a database. Now, I've found that the users are attempting to upload files that have a .csv filetype, but are not actually comma-separated. I'm trying to write a function that will properly determine the delimiter (tab, new line, space, etc.), but am having some trouble with it. I think I would like to get an array of all of these addresses so that the number of keys would add credence to that delimiter.
The code:
$filename = "../some/path/test.csv";
if (($handle = fopen($fileName, "r")) !== FALSE) {
$delimiters = array(',', ' ', "\t", "\n");
$delimNum = 0;
foreach ($delimiters as $delimiter) {
$row = 0;
while (($data = fgetcsv($handle, 1000, $delimiter)) !== FALSE) {
$data = (string)$data[0];
$delimiterList[$delimNum] = explode($delimiter, $data);
$row++;
}
$delimNum++;
}
die(print_r($delimiterList));
}
The result:
Array
(
[0] => Array
(
[0] => email
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
)
)
Like I said, I know this is probably the wrong way to approach this, so I'm thankful for any insight you can provide!
Upvotes: 3
Views: 7262
Reputation: 1560
I will show an algorithm that might be a pretty good solution, don't think of this problem as an easy one, this is like guessing, so this problem will not have a perfect solution.
One should instead try to aproximate a 99% good solution using statistics or some other heuristics. I'm a computer scientist, also a developer, but this is the aproximation that a machine learning or data scientist would give.
Here it is:
Seems complicated but is a pretty good and not difficult algorithm. Below there is one example of the calculations:
Delimiters counts (histogram)
| | ; | , | \t |
|---------|---|---|-----|
| LINE 1 | 3 | 1 | 13 |
| LINE 2 | 2 | 1 | 0 |
| LINE 3 | 3 | 1 | 0 |
| LINE 4 | 3 | 1 | 124 |
| LINE 5 | 2 | 1 | 2 |
| LINE 6 | 2 | 1 | 2 |
| LINE 7 | 3 | 1 | 12 |
| LINE 8 | 3 | 1 | 0 |
| LINE 9 | 3 | 1 | 0 |
| LINE 10 | 3 | 1 | 0 |
Calculations and final score
| | ; | , | \t | | WEIGHTS | ; | , | \t |
|------------|------|------|------|--|---------|------|------|----|
| AVERAGE | 2,7 | 1 | 15,3 | | | | | |
| NORMALIZED | 0,17 | 0,06 | 1 | | 1 | 0,17 | 0,06 | 1 |
| VARIANCE | 0,21 | 0 | 1335 | | | | | |
| NORMALIZED | 0,99 | 1 | 0 | | 3 | 2,99 | 3 | 0 |
| | | | | | SCORE | 3,17 | 3,06 | 1 |
As you can see the delimiter ';' has the better score. I think is also good to weight more the variance than the average of delimiters found. It is more likely to have a file where delimiters don't vary much in each line.
Upvotes: 2
Reputation: 77
SplFileObject::getCsvControl
in the manual
I didn't find it till too late, so wrote a function which works well. In case it is useful/of interest my approach was:
I used $handle
and $ColName
parameters with $ColName
optional
$ColName lets you check which delimiter finds an expected header column name in the first record, if the csv file has a header line.
If no header line, or you don't know the column names, it resorts to a default check: which delimiter finds most fields for the same record (this will usually be the right one). I then also check that that delimiter returns the same number of fields for each of next few lines.
fgetcsv appears to work in blocks and to force each record to have the same number of fields as the max in the block, so this would work even with varying numbers of fields per record
Upvotes: 1
Reputation: 1
This is my solution. Its works if you know how many columns you expect. Finally, the separator character is the $actual_separation_character
$separator_1=",";
$separator_2=";";
$separator_3="\t";
$separator_4=":";
$separator_5="|";
$separator_1_number=0;
$separator_2_number=0;
$separator_3_number=0;
$separator_4_number=0;
$separator_5_number=0;
/* YOU NEED TO CHANGE THIS VARIABLE */
// Expected number of separation character ( 3 colums ==> 2 sepearation caharacter / row )
$expected_separation_character_number=2;
$file = fopen("upload/filename.csv","r");
while(! feof($file)) //read file rows
{
$row= fgets($file);
$row_1_replace=str_replace($separator_1,"",$row);
$row_1_length=strlen($row)-strlen($row_1_replace);
if(($row_1_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_1_number=$separator_1_number+$row_1_length;
}
$row_2_replace=str_replace($separator_2,"",$row);
$row_2_length=strlen($row)-strlen($row_2_replace);
if(($row_2_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_2_number=$separator_2_number+$row_2_length;
}
$row_3_replace=str_replace($separator_3,"",$row);
$row_3_length=strlen($row)-strlen($row_3_replace);
if(($row_3_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_3_number=$separator_3_number+$row_3_length;
}
$row_4_replace=str_replace($separator_4,"",$row);
$row_4_length=strlen($row)-strlen($row_4_replace);
if(($row_4_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_4_number=$separator_4_number+$row_4_length;
}
$row_5_replace=str_replace($separator_5,"",$row);
$row_5_length=strlen($row)-strlen($row_5_replace);
if(($row_5_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
$separator_5_number=$separator_5_number+$row_5_length;
}
} // while(! feof($file)) END
fclose($file);
/* THE FILE ACTUAL SEPARATOR (delimiter) CHARACTER */
/* $actual_separation_character */
if ($separator_1_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_1;}
else if ($separator_2_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_2;}
else if ($separator_3_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_3;}
else if ($separator_4_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_4;}
else if ($separator_5_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_5;}
else {$actual_separation_character=";";}
/*
if the number of columns more than what you expect, do something ...
*/
if ($expected_separation_character_number>0){
if ($separator_1_number==0 and $separator_2_number==0 and $separator_3_number==0 and $separator_4_number==0 and $separator_5_number==0){/* do something ! more columns than expected ! */}
}
Upvotes: 0
Reputation: 40582
Solve this problem with usability instead of code. Have the user pick the delimiter.
However, since they may not know what tab delimited, CSV, et al mean, just show them a preview. They can pick from the options till the output looks correct and tabular.
Then you parse it according to the format selected.
Upvotes: 3
Reputation: 1177
This isn't a perfect solution, but it MIGHT help you - if you can't ask what the delimiter is.
Instead of trying to parse as CSV anymore, try just retrieving valid email addresses. I don't think space, comma, tab or new-line is a valid email part right? (Who knows ;) Check out this discussion on using regular expressions to validate email - so you can see some of the pitfalls of this solution.
But, then I would write the regular expression using preg_match_all() and retrieve a list of all strings in a valid email format.
Good luck!
Upvotes: 1