Syed
Syed

Reputation: 931

Converting comma separated list to an array - explode vs preg_split

I have comma separated list of city names. ($cityNames may contain 100 to 500 names)

$cityNames = "Chicago, San Diego, El Paso";

Which one of the following is better to convert comma separated list to an array, keeping in mind performance and accuracy?

$cityNamesArray = explode(",", $cityNames);

or

$cityNamesArray = preg_split('/\s*,\s*/', $cityNames, -1, PREG_SPLIT_NO_EMPTY);

Note:- the coma separated list is provided by the user, using textarea.

Upvotes: 0

Views: 232

Answers (2)

Rasa Mohamed
Rasa Mohamed

Reputation: 892

In a simple usage explode() is than faster, see: http://micro-optimization.com/explode-vs-preg_split

But preg_split has the advantage of supporting tabs (\t) and spaces with \s.

the \s metacharacter is used to find a whitespace character.

A whitespace character can be (http://php.net/manual/en/regexp.reference.escape.php):

  • space character (32 = 0x20)

  • tab character (9 = 0x09)

  • carriage return character (13 = 0x0D)

  • new line character (10 = 0x0A)

  • form feed character (12 = 0x0C)

In this case you should see the cost and benefit.

A tip, use array_filter for "delete" empty items in array:

Example:

$keyword = explode(' ', $_GET['search']); //or preg_split
print_r($keyword);

$keyword = array_filter($arr, 'empty');
print_r($keyword);

Note: RegExp Perfomance

Upvotes: 0

Sherif
Sherif

Reputation: 11942

I always like to try and point out that the correctness of a solution always takes priority over how fast it works. Something that doesn't work but is really fast is just as much of a problem as something that works, but is really slow.

So I'll address both the correctness of the solution as well as its efficiency separately.

Correctness

A combination of explode() and trim() in conjunction with array_map(), works nicely to achieve your desired goal here.

$cityNamesArray = array_map('trim', explode(',', $cityNames ));

You can also throw in array_filter() here to make sure zero-length strings don't pass through. So in a string like "Chicago, San Diego, El Paso,, New York," you wouldn't get an array with some empty values.

$cityNamesArray = array_filter(array_map('trim', explode(',', $cityNames )), 'strlen');

This assumes the data can be inconsistent and breaking has a detrimental effect on the desired end-result. So the correctness of the solution with stands to that effect.

The combination of function calls here cause the array to iterated several times so you have O(n * 2 + k) time where k is the number characters in the string to seek for delimitation and n is the number of elements in the resulting array passed through array_map and array_filter.

Speed

Now to think how to make it faster, we need to get the big O down closer to O(k) for the most optimal solution, because you can't reduce k any further with a single character needle/haystack substring search.

The preg_split('/\s*,\s*/', $cityNames, -1, PREG_SPLIT_NO_EMPTY) approach has about O(k) time complexity because it's unlikely to be more than O(k + 1) or worst case O(k + log k) if more than a single pass in the PCRE VM.

It also works correctly on the aforementioned case where $cityNames = "Chicago, San Diego, El Paso,, New York," or some similar result.

This means that it meets both the criteria for correctness and efficiency. Thus I would say it is the optimal solution.


Bench Marking

With that said, I think you'll find that the performance differences between the two approaches are fairly negligible.

Here's a rudimentary bench mark to demonstrate just how negligible the differences are on the average input.

$cityNames = "Chicago, San Diego,El Paso,,New York,  ,"; // sample data

$T = 0; // total time spent

for($n = 0; $n < 10000; $n++) {
    $t = microtime(true); // start time
    preg_split('/\s*,\s*/', $cityNames, -1, PREG_SPLIT_NO_EMPTY);
    $t = microtime(true) - $t; // end time
    $T += $t; // aggregate time
}

printf("preg_split took %.06f seconds on average", $T / $n);


$T = 0; // total time spent

for($n = 0; $n < 10000; $n++) {
    $t = microtime(true); // start time
    array_filter(array_map('trim', explode(',', $cityNames )), 'strlen');
    $t = microtime(true) - $t; // end time
    $T += $t; // aggregate time
}

printf("array functions took %.06f seconds on average", $T / $n);
preg_split took 0.000003 seconds on average
array functions took 0.000005 seconds on average

This is an average difference of maybe 1 or 2 microseconds between them. When measuring such minute differences in speed you really shouldn't care too much as long as the solution yields correctness. The better way to account for performance problems is to measure in orders of magnitude. A solution that's 1 or 2 microseconds faster isn't worth exploring if it costs more time to get to than just using the existing solution that's almost as fast, but is equally correct. However, a solution that works 1 or 2 orders of magnitude faster, might be.

Upvotes: 2

Related Questions