How to Separate a string to multiple parts

Question

I have some Persian text (direction rlt) that I want to separate them.

Example:

$str =" کامپیوتر : وسیله ی الکتریکی است 1.ماوس 2.کیبورد
       و مانیتور 3. کیس
چاپگر: وسیله ای است برای پرینت بر روی معمولا کاغذ
موبایل : نوعی تلفن است به صورت سیار و بی سیم
که جدیدا خیلی هم رایج شده است
و اکثر انسان ها دارند
خانه : محلی برای زندگی است. 1. حیوانات 2. انواع انسان ها
برای خود خانه می سازند. ";

I want this output:

{
    arr[
       {
         word: "کامپیوتر",
         mean: "وسیله ی الکتریکی است 1.ماوس 2.کیبورد و مانیتور 3. کیس"  
       },

       {
        word: "چاپگر",
        mean: "وسیله ای است برای پرینت بر روی معمولا کاغذ"
       },

       {
        word: "موبایل",
        mean: "نوعی تلفن است به صورت سیار و بی سیم که جدیدا خیلی هم رایج شده است و اکثر انسان ها دارند"
       },

       {
        word: "خانه",
        mean: "محلی برای زندگی است. 1. حیوانات 2. انواع انسان ها برای خود خانه می سازند."
       }
      ]
}

Well, I think I can't just use explode(":", $str). Because the mean of word is not contestant, it is sometimes in several lines. I think I need to regex. So how can I do that?

Edit: An English example:

$str = "apple : it is a fruit
       computer : 1.an electronic device for storing and 
        processing data typically in binary form 2. according to
        instructionsgiven to it in a variable program"
        wall: a continuous vertical brick or stone structure
        that encloses or divides an area of land. 1. on the
       wall 2. brick wall 3. climbing wall";

I want this output:

{
    arr[
       {
         word: "apple",
         mean: "it is a fruit"  
       },

       {
        word: "computer",
        mean: "1.an electronic device for storing and processing data typically in binary form 2. according to instructionsgiven to it in a variable program"
       },

       {
        word: "wall",
        mean: "a continuous vertical brick or stone structure that encloses or divides an area of land. 1. on the wall 2. brick wall 3. climbing wall"
       }
      ]
}

Wiktor Stribiżew · Accepted Answer

You can use the following regex:

'~\h*(?[^:
]*?)\s*:\s*(?(?:(?!
\h*[^
:]*:).)*)~us'

See regex demo

I am using the named capture groups so that you could access them easier later on. Note that you need /u modifier to work with Unicode strings in PHP regex!

The regex matches:

\h* - 0 or more horizontal whitespace
(?[^: ]*) - Group 1 named "term" that matches 0 or more characters other than : and
\s*:\s* - 0 or more whitespaces followed by : and zero or more whitespaces
(?(?:(?! \h*[^ :]*:).)*) - Group 2 named "mean" that matches any characters (since I am using /s modifier) that are not starting a sequence like spaces+term+:. This (?:(?!...).)* construct is called a tempered greedy token. You can unroll this as (?[^ ]*(?: (?!\h*[^ :]*:)[^ ]*)*) for better performance (192 steps vs. 1226).

Use the regex with the preg_match_all rather than with preg_replace since you need an array:

$str =" کامپیوتر : وسیله ی الکتریکی است 1.ماوس 2.کیبورد
       و مانیتور 3. کیس
چاپگر: وسیله ای است برای پرینت بر روی معمولا کاغذ
موبایل : نوعی تلفن است به صورت سیار و بی سیم
که جدیدا خیلی هم رایج شده است
و اکثر انسان ها دارند
خانه : محلی برای زندگی است. 1. حیوانات 2. انواع انسان ها
برای خود خانه می سازند. ";
preg_match_all('~\h*(?[^:
]*?)\s*:\s*(?(?:(?!
\h*[^
:]*:).)*)~us', $str, $m, PREG_SET_ORDER);
print_r($m);

See the code demo.

How to Separate a string to multiple parts

Answers (2)

Related Questions