user1604298
user1604298

Reputation: 25

preg_split() producing a single row array instead of splitting based on regex

Someone may spot this immediately, but I've been going blind on this search pattern and not sure what I'm missing.

// test string
$stringToSplit = "I awoke in the dim light of the fire pit surrounded by daunting stone walls, my chest tight and my breath stolen by the creak of the heavy oak door opposite my bed. But it wasn’t my bed; that sack of feathers and the sheets of linen were unfamiliar to me. It was the place that my captor had left me. I found it strange, despite struggling against my bonds and having the memory of the cord tearing into my flesh, that no rip, no break of the skin remained. My hands were free, though; bits of rope—severed by a knife or a sword—lay on the floor beside me.";

//test split parameters
$split = '/["’“]?(A-Z)(((Mr|Ms|Mrs|Dr|Gen|Col|Maj|Capt|Lt|Sgt|Cpl|Pvt|Hon|Jr|Sr|St|Rev|Prof)\.\s+((?!\w{2,}[.?!][’\"]?\s+["’]?[A-Z]).))?)((?![.?!]["’]?\s+["’]?[A-Z]).)[.?!…—]+["’”]?/';

//split based on parameters
$splitText = preg_split($split, $stringToSplit);

//return split text
print_r($splitText);

Current output:

Array ( [0] => I awoke in the dim light of the fire pit surrounded by daunting stone walls, my chest tight and my breath stolen by the creak of the heavy oak door opposite my bed. But it wasn’t my bed; that sack of feathers and the sheets of linen were unfamiliar to me. It was the place that my captor had left me. I found it strange, despite struggling against my bonds and having the memory of the cord tearing into my flesh, that no rip, no break of the skin remained. My hands were free, though; bits of rope—severed by a knife or a sword—lay on the floor beside me. )

Desired output:

Array ( [0] => I awoke in the dim light of the fire pit surrounded by daunting stone walls, my chest tight and my breath stolen by the creak of the heavy oak door opposite my bed.
[1] = > But it wasn’t my bed; that sack of feathers and the sheets of linen were unfamiliar to me. 
[2] = > It was the place that my captor had left me. 
[3] = > I found it strange, despite struggling against my bonds and having the memory of the cord tearing into my flesh, that no rip, no break of the skin remained.
[4] = > My hands were free, though; bits of rope—severed by a knife or a sword—lay on the floor beside me. )

The regex is complicated because it's meant to find those patterns to be able to split any string in the text properly and not getting hung up on abbreviations and endings that are not the true ending of the segment. While all of the rules do not apply to the sample text, I need those rules to parse any given sample.

As it stands, the code returns a single key/value pair of key 0 with the value being the entire un-split string.

Edit to add: I am adding a larger sample of the text which shows the reasons for some of the rules from the regex string, for clarity.

$stringToSplit = "I awoke in the dim light of the fire pit surrounded by daunting stone walls, my chest tight and my breath stolen by the creak of the heavy oak door opposite my bed. But it wasn’t my bed; that sack of feathers and the sheets of linen were unfamiliar to me. It was the place that my captor had left me. I found it strange, despite struggling against my bonds and having the memory of the cord tearing into my flesh, that no rip, no break of the skin remained. My hands were free, though; bits of rope—severed by a knife or a sword—lay on the floor beside me. They must be sure that I won’t… can’t escape. “Good,” my captor said, stepping the rest of the way into the room. “You’ve awakened.” The way he said it sent tingles racing along my skin. Whereas I considered waking up a trivial matter, this man seemed to reflect upon the act with some reverence. The man’s cloak, his cowl draped over his hair and forehead, matched the drab gray of my prison’s walls, and a shadow cast over his face made it impossible to distinguish any of his features. His eyes, though, were obvious, and they must have caught the firelight because they glowed pale blue. “My family…” I started, inching away as if I could escape through the stone at my back. “They’ll pay whatever ransom you ask. Please, I beg—” “You waste your breath.” The man approached, but he stopped at the table halfway and lay upon it folded cloth. “I am not the one who keeps you here.” “But you serve him… her? You must reason with your master—” “I must do nothing,” he replied, laughing. “And your family might not want you in your condition. Have you smelled yourself lately?” “No,” I said flatly, and it wasn’t until the man had said something that I noticed I couldn’t smell the wood roasting in the fireplace, or anything else for that matter. My whole body was numb except for my head, which still ached. I recalled that he had bashed me in the head with a club, but I couldn’t piece together much else. “Why are you keeping me here?” “You’ll see.” He gestured at the table. “I suggest you change.” And he closed the door behind him. I stood there for a time, consumed with loathing and hatred for the man. I glanced at the fire and then at the table. When I studied the door from where I stood, I realized that it had no lock, and the place seemed unlike any cell I’d ever seen. No prisoner, for all I knew, had ever been treated to his own fireplace, stuffed mattress, or wash basin. And so, believing my chances of escape slim and without any available options, I stripped the tattered clothes from my body. The shirt—the one my father had bought for me, the fine silk one—couldn’t be salvaged. The pants, too, were in ribbons and came off easily.";

Upvotes: 0

Views: 105

Answers (2)

Maciej Król
Maciej Król

Reputation: 382

I've little simplified your regex. I've also used negative lookbehind which may be not supported if you are using this in browser.

But you can try this.

(?<!Mr|Ms|Mrs|Dr|Gen|Col|Maj|Capt|Lt|Sgt|Cpl|Pvt|Hon|Jr|Sr|St|Rev|Prof\.)(?<!["”“'])[.!?]+(?!["“”'])

Tested on Google Chrome v76.0.3809.132 here using your bigger text sample and everything seems working correctly.

Features:

  • Match dots
  • Don't match dots after Mr, Ms, etc.
  • Dont match dot between ",”, “, '

Edit.

Solution for keeping delimiters is using positive lookbehind after matching dots with negative lookbehind.

$regex = "/(?<=(?<!Mr|Ms|Mrs|Dr|Gen|Col|Maj|Capt|Lt|Sgt|Cpl|Pvt|Hon|Jr|Sr|St|Rev|Prof\.)(?<![\"”“'])[!?.](?![!?.])(?![\"“”']))/";

$subject = "your text here";

$result = preg_split($regex, $subject, 0, PREG_SPLIT_NO_EMPTY);

Upvotes: 1

Lucas Arbex
Lucas Arbex

Reputation: 909

If you want to split your string on every period (like the example you showed), but not when they are preceded by Mr|Ms|Mrs..., you can just do something like this:

$stringToSplit = "I awoke in the dim light of the fire pit surrounded by daunting stone walls, my chest tight and my breath stolen by the creak of the heavy oak door opposite my bed. But it wasn’t my bed; that sack of feathers and the sheets of linen were unfamiliar to me. It was the place that my captor had left me. I found it strange, despite struggling against my bonds and having the memory of the cord tearing into my flesh, that no rip, no break of the skin remained. My hands were free, though; bits of rope—severed by a knife or a sword—lay on the floor beside me. They must be sure that I won’t… can’t escape. “Good,” my captor said, stepping the rest of the way into the room. “You’ve awakened.” The way he said it sent tingles racing along my skin. Whereas I considered waking up a trivial matter, this man seemed to reflect upon the act with some reverence. The man’s cloak, his cowl draped over his hair and forehead, matched the drab gray of my prison’s walls, and a shadow cast over his face made it impossible to distinguish any of his features. His eyes, though, were obvious, and they must have caught the firelight because they glowed pale blue. “My family…” I started, inching away as if I could escape through the stone at my back. “They’ll pay whatever ransom you ask. Please, I beg—” “You waste your breath.” The man approached, but he stopped at the table halfway and lay upon it folded cloth. “I am not the one who keeps you here.” “But you serve him… her? You must reason with your master—” “I must do nothing,” he replied, laughing. “And your family might not want you in your condition. Have you smelled yourself lately?” “No,” I said flatly, and it wasn’t until the man had said something that I noticed I couldn’t smell the wood roasting in the fireplace, or anything else for that matter. My whole body was numb except for my head, which still ached. I recalled that he had bashed me in the head with a club, but I couldn’t piece together much else. “Why are you keeping me here?” “You’ll see.” He gestured at the table. “I suggest you change.” And he closed the door behind him. I stood there for a time, consumed with loathing and hatred for the man. I glanced at the fire and then at the table. When I studied the door from where I stood, I realized that it had no lock, and the place seemed unlike any cell I’d ever seen. No prisoner, for all I knew, had ever been treated to his own fireplace, stuffed mattress, or wash basin. And so, believing my chances of escape slim and without any available options, I stripped the tattered clothes from my body. The shirt—the one my father had bought for me, the fine silk one—couldn’t be salvaged. The pants, too, were in ribbons and came off easily.";

$split = preg_split('/(?:(?<!Mr|Ms|Mrs|Dr|Gen|Col|Maj|Capt|Lt|Sgt|Cpl|Pvt|Hon|Jr|Sr|St|Rev|Prof)\.|[!?)"])/', iconv('UTF-8', 'ASCII//TRANSLIT', $stringToSplit));

var_dump(array_filter(array_map('trim', $split))); // I've used array_map to trim any white spaces and then array filter remove empty array elements

EDIT: To split on periods, but not when them are preceded by Mr|Ms|Mrs... just use regex negative lookbehind.

Let me know if it can be useful to you now.

Upvotes: 0

Related Questions