Regular Expression to Clean a numbered list

Question

I've only just started playing with Regex and seem to be a little stuck! I have written a bulk find and replace using multiline in TextSoap. It is for cleaning up recipes that I have OCR'd and because there is Ingredients and Directions I cannot change a "1 " to become "1. " as this could rewrite "1 Tbsp" as "1. Tbsp".

I therefore did a check to see if the following two lines (possibly with extra rows) was the next sequential numbers using this code as the find:

^(1) (.*)
?((
))(^2 (.*)
?(
)^3 (.*)
?(
))
^(2) (.*)
?((
))(^3 (.*)
?(
)^4 (.*)
?(
))
^(3) (.*)
?((
))(^4 (.*)
?(
)^5 (.*)
?(
))
^(4) (.*)
?((
))(^5 (.*)
?(
)^6 (.*)
?(
))
^(5) (.*)
?((
))(^6 (.*)
?(
)^7 (.*)
?(
))

and the following as the replace for each of the above:

$1. $2 $3 $4$5

My Problem is that although it works as I wanted it to, it will never perform the task for the last three numbers...

An example of the text I want to clean up:

1 This is the first step in the list

2 Second lot if instructions to run through
3 Doing more of the recipe instruction

4 Half way through cooking up a storm

5 almost finished the recipe

6 Serve and eat

And what I want it to look like:

1. This is the first step in the list

2. Second lot if instructions to run through

3. Doing more of the recipe instruction

4. Half way through cooking up a storm

5. almost finished the recipe

6. Serve and eat

Is there a way to check the previous line or two above to run this backwards? I have looked at lookahead and lookbehind and I am somewhat confused at that point. Does anybody have a method to clean up my numbered list or help me with the regex I desire please?

alan · Accepted Answer

dan1111 is right. You may run into trouble with similar looking data. But given the sample you provided, this should work:

^(\d+)\s+([^
]+)(?:[
]*) // search

$1. $2

                 // replace

If you're not using Windows, remove the s from the replace string.

Explanation:

^           // beginning of the line
(\d+)       // capture group 1. one or more digits
\s+         // any spaces after the digit. don't capture
([^
]+)  // capture group 2. all characters up to any EOL
(?:[
]*) // consume additional EOL, but do not capture

Replace:

$1.       // group 1 (the digit), then period and a space
$2        // group 2


  // two EOLs, to create a blank line
          // (remove both 
 for Linux)

Regular Expression to Clean a numbered list

Answers (2)

Related Questions