ConfusedByRegex
ConfusedByRegex

Reputation: 41

How to split text into "steps" using regex in perl?

I am trying to split texts into "steps" Lets say my text is

my $steps = "1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!" 

I'd like the output to be:

"1.Do this."
"2.Then do that."
"3.And then maybe that."
"4.Complete!"

I'm not really that good with regex so help would be great!

I've tried many combination like:

split /(\s\d.)/ 

But it splits the numbering away from text

Upvotes: 1

Views: 89

Answers (3)

zdim
zdim

Reputation: 66924

All step-descriptions start with a number followed by a period and then have non-numbers, until the next number. So capture all such patterns

my @s = $steps =~ / [0-9]+\. [^0-9]+ /xg; 

say for @s;

This works only if there are surely no numbers in the steps' description, like any approach relying on matching a number (even if followed by a period, for decimal numbers)

If there may be numbers in there, we'd need to know more about the structure of the text.

Another delimiting pattern to consider is punctuation that ends a sentence (. and ! in these examples), if there are no such characters in steps' description and there are no multiple sentences

my @s = $steps =~ / [0-9]+\. .*? [.!] /xg;

Augment the list of patterns that end an item's description as needed, say with a ?, and/or ." sequence as punctuation often goes inside quotes.

If an item can have multiple sentences, or use end-of-sentence punctuation mid-sentence (as a part of a quotation perhaps) then tighten the condition for an item's end by combining footnotes -- end-of-sentence punctuation and followed by number+period

my @s = $steps =~ /[0-9]+\. .*? (?: \."|\!"|[.\!]) (?=\s+[0-9]+\. | \z)/xg;

If this isn't good enough either then we'd really need a more precise description of that text.


An approach using a "numbers-period" pattern to delimit item's description, like

/ [0-9]+\. .*? (?=\s+[0-9]+\. | \z) /xg;

(or in a lookahead in split) fails with text like

1. Only $2.50   or   1. Version 2.4.1   ...


To include text like 1. Do "this." and 2. Or "that!" we'd want

/ [0-9]+\. .*? (?: \." | !" | [.!?]) /xg;

Upvotes: 3

Polar Bear
Polar Bear

Reputation: 6808

Following sample code demonstrates power of regex to fill up %steps hash in one line of code.

Once the data obtained you can dice and slice it anyway your heart desires.

Inspect the sample for compliance with your problem.

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my($str,%steps,$re);

$str   = '1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!';
$re    = qr/(\d+)\.(\D+)\./;
%steps = $str =~ /$re/g;

say Dumper(\%steps);

say "$_. $steps{$_}" for sort keys %steps;

Output

$VAR1 = {
          '1' => 'Do this',
          '2' => 'Then do that',
          '3' => 'And then maybe that'
        };

1. Do this
2. Then do that
3. And then maybe that

Upvotes: 0

ikegami
ikegami

Reputation: 386426

I would indeed use split. But you need to exclude the digit from the match by using a lookahead.

my @steps = split /\s+(?=\d+\.)/, $steps;

Upvotes: 4

Related Questions