Daeto
Daeto

Reputation: 480

Split string by spaces and between whole words or individual symbols

I am looking for a regular expression that would also identify and separate commas, the equal sign and any other special characters that I might need in the input.

Right now what I have is $content = preg_split('/[\s]+/', $file_content, -1, PREG_SPLIT_NO_EMPTY);

Which stores the content of the input file into an array where each element is separated by blank spaces.

However for example for function a (int i) {}; the array would look like this:

[0] = function
[1] = a
[2] = (int
[3] = i)
[4] = {};

And what I'd like to achieve with the regular expression is this:

[0] = function
[1] = a
[2] = (
[3] = int
[4] = i
[5] = )
[6] = {
[7] = }
[8] = ;

Upvotes: 1

Views: 1022

Answers (3)

mickmackusa
mickmackusa

Reputation: 47874

I'll recommend matching a single non-letter or one-or-more letters, then restarting the fullstring match, then actually splitting on zero-or-more whitespaces. (Demo)

var_export(
    preg_split(
        '/(?:\PL|\pL*)\K\s*/u',
        $input,
        -1,
        PREG_SPLIT_NO_EMPTY
    )
);

Compare:

  • (?:\PL|\pL*)\K\s* 58 steps (with PREG_SPLIT_NO_EMPTY)

  • (?:\pL+|\S)\K\s* 59 steps (with PREG_SPLIT_NO_EMPTY)

  • ([\p{P}\p{S}])|\s 75 steps (with PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE)

  • (?>\PL|\pL*)\K\s*(?!$) 85 steps (no flags needed)

Upvotes: 0

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Use preg_split function with PREG_SPLIT_DELIM_CAPTURE flag:

PREG_SPLIT_DELIM_CAPTURE

If this flag is set, parenthesized expression
in the delimiter pattern will be captured and returned as well.
$input = 'function a (int i) {};';
$content = preg_split('/([\p{P}\p{S}])|\s/', $input,
           -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

print_r($content);

The output:

Array
(
    [0] => function
    [1] => a
    [2] => (
    [3] => int
    [4] => i
    [5] => )
    [6] => {
    [7] => }
    [8] => ;
)

Upvotes: 3

ssc-hrep3
ssc-hrep3

Reputation: 16069

Instead of using the split() function for this, you can use the following pattern in combination with preg_match_all():

[a-zA-Z]+|[^a-zA-Z\s]

It actually looks for multiple characters of [a-zA-Z] (1 or more) or a single character which is not [a-zA-Z] and not a whitespace character.

Here is an example:

<?php
  $string = "function a (int i) {};";
  $regex = "/[a-zA-Z]+|[^a-zA-Z\s]/";
  $matches = array();
  preg_match_all($regex, $string, $matches);

  print_r($matches);
?>

This example can be run here.

Upvotes: 5

Related Questions