Palendrone
Palendrone

Reputation: 363

Regular Expression to Clean a numbered list

I've only just started playing with Regex and seem to be a little stuck! I have written a bulk find and replace using multiline in TextSoap. It is for cleaning up recipes that I have OCR'd and because there is Ingredients and Directions I cannot change a "1 " to become "1. " as this could rewrite "1 Tbsp" as "1. Tbsp".

I therefore did a check to see if the following two lines (possibly with extra rows) was the next sequential numbers using this code as the find:

^(1) (.*)\n?((\n))(^2 (.*)\n?(\n)^3 (.*)\n?(\n))
^(2) (.*)\n?((\n))(^3 (.*)\n?(\n)^4 (.*)\n?(\n))
^(3) (.*)\n?((\n))(^4 (.*)\n?(\n)^5 (.*)\n?(\n))
^(4) (.*)\n?((\n))(^5 (.*)\n?(\n)^6 (.*)\n?(\n))
^(5) (.*)\n?((\n))(^6 (.*)\n?(\n)^7 (.*)\n?(\n))

and the following as the replace for each of the above:

$1. $2 $3 $4$5

My Problem is that although it works as I wanted it to, it will never perform the task for the last three numbers...

An example of the text I want to clean up:

1 This is the first step in the list

2 Second lot if instructions to run through
3 Doing more of the recipe instruction

4 Half way through cooking up a storm

5 almost finished the recipe

6 Serve and eat

And what I want it to look like:

1. This is the first step in the list

2. Second lot if instructions to run through

3. Doing more of the recipe instruction

4. Half way through cooking up a storm

5. almost finished the recipe

6. Serve and eat

Is there a way to check the previous line or two above to run this backwards? I have looked at lookahead and lookbehind and I am somewhat confused at that point. Does anybody have a method to clean up my numbered list or help me with the regex I desire please?

Upvotes: 7

Views: 1948

Answers (2)

alan
alan

Reputation: 4842

dan1111 is right. You may run into trouble with similar looking data. But given the sample you provided, this should work:

^(\d+)\s+([^\r\n]+)(?:[\r\n]*) // search

$1. $2\r\n\r\n                 // replace

If you're not using Windows, remove the \rs from the replace string.

Explanation:

^           // beginning of the line
(\d+)       // capture group 1. one or more digits
\s+         // any spaces after the digit. don't capture
([^\r\n]+)  // capture group 2. all characters up to any EOL
(?:[\r\n]*) // consume additional EOL, but do not capture

Replace:

$1.       // group 1 (the digit), then period and a space
$2        // group 2
\r\n\r\n  // two EOLs, to create a blank line
          // (remove both \r for Linux)

Upvotes: 2

user1919238
user1919238

Reputation:

What about this?

1 Tbsp salt
2 Tsp sugar
3 Eggs

You have run into a major limitation of regexes: they don't work well when your data can't be strictly defined. You may intuitively know what are ingredients and what are steps, but it isn't easy to go from that to a reliable set of rules for an algorithm.

I suggest you instead think about an approach that is based on position within the file. A given cookbook usually formats all the recipes the same: such as, the ingredients come first, followed by the list of steps. This would probably be an easier way to tell the difference.

Upvotes: 1

Related Questions