JamieHoward
JamieHoward

Reputation: 693

Determine .csv delimiter in PHP

Note: I'll start off by saying that I know I'm probably missing something really obvious. I'm in one of those coding hazes where I can't see the simple solution.

Issue: I've written a script in PHP to parse a .csv file, select the column containing e-mail addresses, and put them into a database. Now, I've found that the users are attempting to upload files that have a .csv filetype, but are not actually comma-separated. I'm trying to write a function that will properly determine the delimiter (tab, new line, space, etc.), but am having some trouble with it. I think I would like to get an array of all of these addresses so that the number of keys would add credence to that delimiter.

The code:

$filename = "../some/path/test.csv";   
if (($handle = fopen($fileName, "r")) !== FALSE) {
    $delimiters = array(',', ' ', "\t", "\n");
    $delimNum = 0;
    foreach ($delimiters as $delimiter) {
      $row = 0;
      while (($data = fgetcsv($handle, 1000, $delimiter)) !== FALSE) {
        $data = (string)$data[0];
        $delimiterList[$delimNum] = explode($delimiter, $data);
        $row++;
    }
    $delimNum++;
}
die(print_r($delimiterList));
}

The result:

Array
(
[0] => Array
    (
        [0] => email
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
    )
)

Like I said, I know this is probably the wrong way to approach this, so I'm thankful for any insight you can provide!

Upvotes: 3

Views: 7262

Answers (5)

David Lemon
David Lemon

Reputation: 1560

I will show an algorithm that might be a pretty good solution, don't think of this problem as an easy one, this is like guessing, so this problem will not have a perfect solution.

One should instead try to aproximate a 99% good solution using statistics or some other heuristics. I'm a computer scientist, also a developer, but this is the aproximation that a machine learning or data scientist would give.

Here it is:

  1. Pick a number of random lines of all the lines of the file, say 10
  2. Count the number of occurrences of each delimiter candidate
  3. With this number calculate the average and the variance of each delimiter.
  4. Normalize the numbers, this means give numbers between 0 and 1, using your custom linear function
  5. Give weigths to the two values for each delimiters and sum, this gives a score of each delimiters that you can use as a decision

Seems complicated but is a pretty good and not difficult algorithm. Below there is one example of the calculations:

Delimiters counts (histogram)

|         | ; | , | \t  |
|---------|---|---|-----|
| LINE 1  | 3 | 1 |  13 |
| LINE 2  | 2 | 1 |   0 |
| LINE 3  | 3 | 1 |   0 |
| LINE 4  | 3 | 1 | 124 |
| LINE 5  | 2 | 1 |   2 |
| LINE 6  | 2 | 1 |   2 |
| LINE 7  | 3 | 1 |  12 |
| LINE 8  | 3 | 1 |   0 |
| LINE 9  | 3 | 1 |   0 |
| LINE 10 | 3 | 1 |   0 |

Calculations and final score

|            |  ;   |  ,   |  \t  |  | WEIGHTS |  ;   |  ,   | \t |
|------------|------|------|------|--|---------|------|------|----|
| AVERAGE    |  2,7 |    1 | 15,3 |  |         |      |      |    |
| NORMALIZED | 0,17 | 0,06 |    1 |  | 1       | 0,17 | 0,06 |  1 |
| VARIANCE   | 0,21 |    0 | 1335 |  |         |      |      |    |
| NORMALIZED | 0,99 |    1 |    0 |  | 3       | 2,99 |    3 |  0 |
|            |      |      |      |  | SCORE   | 3,17 | 3,06 |  1 |

As you can see the delimiter ';' has the better score. I think is also good to weight more the variance than the average of delimiters found. It is more likely to have a file where delimiters don't vary much in each line.

Upvotes: 2

user3600150
user3600150

Reputation: 77

SplFileObject::getCsvControl in the manual

I didn't find it till too late, so wrote a function which works well. In case it is useful/of interest my approach was:

I used $handle and $ColName parameters with $ColName optional

$ColName lets you check which delimiter finds an expected header column name in the first record, if the csv file has a header line.

If no header line, or you don't know the column names, it resorts to a default check: which delimiter finds most fields for the same record (this will usually be the right one). I then also check that that delimiter returns the same number of fields for each of next few lines.

fgetcsv appears to work in blocks and to force each record to have the same number of fields as the max in the block, so this would work even with varying numbers of fields per record

Upvotes: 1

user3917103
user3917103

Reputation: 1

This is my solution. Its works if you know how many columns you expect. Finally, the separator character is the $actual_separation_character

$separator_1=",";
$separator_2=";";
$separator_3="\t";
$separator_4=":";
$separator_5="|";

$separator_1_number=0;
$separator_2_number=0;
$separator_3_number=0;
$separator_4_number=0;
$separator_5_number=0;

/* YOU NEED TO CHANGE THIS VARIABLE */
// Expected number of separation character ( 3 colums ==> 2 sepearation caharacter / row )
$expected_separation_character_number=2;  


$file = fopen("upload/filename.csv","r");
while(! feof($file)) //read file rows
{
    $row= fgets($file);

    $row_1_replace=str_replace($separator_1,"",$row);
    $row_1_length=strlen($row)-strlen($row_1_replace);

    if(($row_1_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_1_number=$separator_1_number+$row_1_length;
    }

    $row_2_replace=str_replace($separator_2,"",$row);
    $row_2_length=strlen($row)-strlen($row_2_replace);

    if(($row_2_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_2_number=$separator_2_number+$row_2_length;
    }

    $row_3_replace=str_replace($separator_3,"",$row);
    $row_3_length=strlen($row)-strlen($row_3_replace);

    if(($row_3_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_3_number=$separator_3_number+$row_3_length;
    }

    $row_4_replace=str_replace($separator_4,"",$row);
    $row_4_length=strlen($row)-strlen($row_4_replace);

    if(($row_4_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_4_number=$separator_4_number+$row_4_length;
    }

    $row_5_replace=str_replace($separator_5,"",$row);
    $row_5_length=strlen($row)-strlen($row_5_replace);

    if(($row_5_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_5_number=$separator_5_number+$row_5_length;
    }

} // while(! feof($file))  END
fclose($file);

/* THE FILE ACTUAL SEPARATOR (delimiter) CHARACTER */
/* $actual_separation_character */

if ($separator_1_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_1;}
else if ($separator_2_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_2;}
else if ($separator_3_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_3;}
else if ($separator_4_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_4;}
else if ($separator_5_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_5;}
else {$actual_separation_character=";";}

/* 
if the number of columns more than what you expect, do something ...
*/

if ($expected_separation_character_number>0){
if ($separator_1_number==0 and $separator_2_number==0 and $separator_3_number==0 and $separator_4_number==0 and $separator_5_number==0){/* do something ! more columns than expected ! */}
}

Upvotes: 0

Kato
Kato

Reputation: 40582

Solve this problem with usability instead of code. Have the user pick the delimiter.

However, since they may not know what tab delimited, CSV, et al mean, just show them a preview. They can pick from the options till the output looks correct and tabular.

Then you parse it according to the format selected.

Upvotes: 3

Aaron Saray
Aaron Saray

Reputation: 1177

This isn't a perfect solution, but it MIGHT help you - if you can't ask what the delimiter is.

Instead of trying to parse as CSV anymore, try just retrieving valid email addresses. I don't think space, comma, tab or new-line is a valid email part right? (Who knows ;) Check out this discussion on using regular expressions to validate email - so you can see some of the pitfalls of this solution.

But, then I would write the regular expression using preg_match_all() and retrieve a list of all strings in a valid email format.

Good luck!

Upvotes: 1

Related Questions