Reputation: 363
I've only just started playing with Regex and seem to be a little stuck! I have written a bulk find and replace using multiline in TextSoap. It is for cleaning up recipes that I have OCR'd and because there is Ingredients and Directions I cannot change a "1 " to become "1. " as this could rewrite "1 Tbsp" as "1. Tbsp".
I therefore did a check to see if the following two lines (possibly with extra rows) was the next sequential numbers using this code as the find:
^(1) (.*)\n?((\n))(^2 (.*)\n?(\n)^3 (.*)\n?(\n))
^(2) (.*)\n?((\n))(^3 (.*)\n?(\n)^4 (.*)\n?(\n))
^(3) (.*)\n?((\n))(^4 (.*)\n?(\n)^5 (.*)\n?(\n))
^(4) (.*)\n?((\n))(^5 (.*)\n?(\n)^6 (.*)\n?(\n))
^(5) (.*)\n?((\n))(^6 (.*)\n?(\n)^7 (.*)\n?(\n))
and the following as the replace for each of the above:
$1. $2 $3 $4$5
My Problem is that although it works as I wanted it to, it will never perform the task for the last three numbers...
An example of the text I want to clean up:
1 This is the first step in the list
2 Second lot if instructions to run through
3 Doing more of the recipe instruction
4 Half way through cooking up a storm
5 almost finished the recipe
6 Serve and eat
And what I want it to look like:
1. This is the first step in the list
2. Second lot if instructions to run through
3. Doing more of the recipe instruction
4. Half way through cooking up a storm
5. almost finished the recipe
6. Serve and eat
Is there a way to check the previous line or two above to run this backwards? I have looked at lookahead and lookbehind and I am somewhat confused at that point. Does anybody have a method to clean up my numbered list or help me with the regex I desire please?
Upvotes: 7
Views: 1948
Reputation: 4842
dan1111 is right. You may run into trouble with similar looking data. But given the sample you provided, this should work:
^(\d+)\s+([^\r\n]+)(?:[\r\n]*) // search
$1. $2\r\n\r\n // replace
If you're not using Windows, remove the \r
s from the replace string.
Explanation:
^ // beginning of the line
(\d+) // capture group 1. one or more digits
\s+ // any spaces after the digit. don't capture
([^\r\n]+) // capture group 2. all characters up to any EOL
(?:[\r\n]*) // consume additional EOL, but do not capture
Replace:
$1. // group 1 (the digit), then period and a space
$2 // group 2
\r\n\r\n // two EOLs, to create a blank line
// (remove both \r for Linux)
Upvotes: 2
Reputation:
What about this?
1 Tbsp salt
2 Tsp sugar
3 Eggs
You have run into a major limitation of regexes: they don't work well when your data can't be strictly defined. You may intuitively know what are ingredients and what are steps, but it isn't easy to go from that to a reliable set of rules for an algorithm.
I suggest you instead think about an approach that is based on position within the file. A given cookbook usually formats all the recipes the same: such as, the ingredients come first, followed by the list of steps. This would probably be an easier way to tell the difference.
Upvotes: 1