Joseph Polizzotto
Joseph Polizzotto

Reputation: 63

Finding Matching Strings Within Paragraphs

I have a TXT file with LaTeX math equations where a single $ delimiter is used before and after each inline equation.

I would like to find each of the equations within a paragraph and replace the delimiters with XML opening and closing tags ....

E.g.,

The following paragraph:

This is the beginning of a paragraph $first equation$ ...and here is some text... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$

should become:

This is the beginning of a paragraph <equation>first equation</equation> ...and here is some text... <equation>second equation</equation> ...and here is more text... <equation>third equation</equation> ...and here is yet more text... <equation>fourth equation</equation>

I have tried the sed and perl commands such as the following:

perl -p -e 's/(\$)(.*[^\$])(\$)/<equation>$2<\/equation>/'

But these commands result in the first and last instances of equations being converted but none of the equations between these two:

This is the beginning of a paragraph <equation>first equation$ ...and here is some text... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation</equation>

I also would like a robust solution that could take into account the presence of a single $ that is not used as a LaTeX delimiter. E.g.,

This is the beginning of a paragraph $first equation$ ...and here is some text that includes a single dollar sign: He paid $2.50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$

does not become:

This is the beginning of a paragraph <equation>first equation$ ...and here is some text that includes a single dollar sign: He paid <equation>2.50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation</equation>

Note: I am writing in Bash.

Upvotes: 3

Views: 125

Answers (2)

markp-fuso
markp-fuso

Reputation: 34856

NOTE: First part of this answer focuses solely on replacing pairs of $'s; for OP's request to not replace standalone $'s ... see 2nd half of answer.


Replacing pairs of $'s

Sample data:

$ cat latex.txt
... $first equation$ ... $second equation$ ... $third equation$

One sed idea:

sed -E 's|\$([^$]*)\$|<equation>\1</equation>|g' latex.txt

Where:

  • -E - enable extended regex support
  • \$ - match a literal $
  • ([^$]*) - [capture group #1] - match everything that is not a literal $ (in this case everything between the pair of $'s)
  • \$ - match a literal $
  • <equation>\1</equation> - replace the matched string with <equation> + contents of capture group + </equation>
  • /g - repeat search/replace as often as necessary

This generates:

... <equation>first equation</equation> ... <equation>second equation</equation> ... <equation>third equation</equation>

Dealing with standalone $

If the standalone $ can be escaped (eg, \$) one idea would be to have sed replace this with a nonsensical literal, perform the <equation> / </equation> replacement, then change the nonsensical literal back to \$.

Sample data:

$ cat latex.txt
... $first equation$ ... $second equation$ ... $third equation$
... $first equation$ ... \$3.50 cup of coffee ... $third equation$

Original sed solution with the new replacements:

sed -E 's|\\\$|LITDOL|g;s|\$([^$]*)\$|<equation>\1</equation>|g;s|LITDOL|\\\$|g' latex.txt

Where we replace \$ with LITDOL (LITeral DOLlar), perform our original replacment, then switch LITDOL back to \$.

Which generates:

... <equation>first equation</equation> ... <equation>second equation</equation> ... <equation>third equation</equation>
... <equation>first equation</equation> ... \$3.50 cup of coffee ... <equation>third equation</equation>

Upvotes: 5

stack0114106
stack0114106

Reputation: 8711

Try this Perl using negative lookahead.

$ cat joseph.txt
This is the beginning of a paragraph $first equation$ ...and here is some text that includes a single dollar sign: He paid $2.50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$
$ perl -p -e 's/(\$)(?![\d.]+)(.+?)(\$)/<equation>$2<\/equation>/g' joseph.txt
This is the beginning of a paragraph <equation>first equation</equation> ...and here is some text that includes a single dollar sign: He paid $2.50 for a pack of cigarettes... <equation>second equation</equation> ...and here is more text... <equation>third equation</equation> ...and here is yet more text... <equation>fourth equation</equation>
$

Upvotes: 3

Related Questions