Reputation: 63
I have a TXT file with LaTeX math equations where a single $ delimiter is used before and after each inline equation.
I would like to find each of the equations within a paragraph and replace the delimiters with XML opening and closing tags ....
E.g.,
The following paragraph:
This is the beginning of a paragraph $first equation$ ...and here is some text... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$
should become:
This is the beginning of a paragraph <equation>first equation</equation> ...and here is some text... <equation>second equation</equation> ...and here is more text... <equation>third equation</equation> ...and here is yet more text... <equation>fourth equation</equation>
I have tried the sed and perl commands such as the following:
perl -p -e 's/(\$)(.*[^\$])(\$)/<equation>$2<\/equation>/'
But these commands result in the first and last instances of equations being converted but none of the equations between these two:
This is the beginning of a paragraph <equation>first equation$ ...and here is some text... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation</equation>
I also would like a robust solution that could take into account the presence of a single $ that is not used as a LaTeX delimiter. E.g.,
This is the beginning of a paragraph $first equation$ ...and here is some text that includes a single dollar sign: He paid $2.50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$
does not become:
This is the beginning of a paragraph <equation>first equation$ ...and here is some text that includes a single dollar sign: He paid <equation>2.50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation</equation>
Note: I am writing in Bash.
Upvotes: 3
Views: 125
Reputation: 34856
NOTE: First part of this answer focuses solely on replacing pairs of $'s
; for OP's request to not replace standalone $'s
... see 2nd half of answer.
Replacing pairs of $'s
Sample data:
$ cat latex.txt
... $first equation$ ... $second equation$ ... $third equation$
One sed
idea:
sed -E 's|\$([^$]*)\$|<equation>\1</equation>|g' latex.txt
Where:
-E
- enable extended regex support\$
- match a literal $
([^$]*)
- [capture group #1] - match everything that is not a literal $
(in this case everything between the pair of $'s
)\$
- match a literal $
<equation>\1</equation>
- replace the matched string with <equation>
+ contents of capture group
+ </equation>
/g
- repeat search/replace as often as necessaryThis generates:
... <equation>first equation</equation> ... <equation>second equation</equation> ... <equation>third equation</equation>
Dealing with standalone $
If the standalone $
can be escaped (eg, \$
) one idea would be to have sed
replace this with a nonsensical literal, perform the <equation> / </equation>
replacement, then change the nonsensical literal back to \$
.
Sample data:
$ cat latex.txt
... $first equation$ ... $second equation$ ... $third equation$
... $first equation$ ... \$3.50 cup of coffee ... $third equation$
Original sed
solution with the new replacements:
sed -E 's|\\\$|LITDOL|g;s|\$([^$]*)\$|<equation>\1</equation>|g;s|LITDOL|\\\$|g' latex.txt
Where we replace \$
with LITDOL
(LITeral DOLlar), perform our original replacment, then switch LITDOL
back to \$
.
Which generates:
... <equation>first equation</equation> ... <equation>second equation</equation> ... <equation>third equation</equation>
... <equation>first equation</equation> ... \$3.50 cup of coffee ... <equation>third equation</equation>
Upvotes: 5
Reputation: 8711
Try this Perl using negative lookahead.
$ cat joseph.txt
This is the beginning of a paragraph $first equation$ ...and here is some text that includes a single dollar sign: He paid $2.50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$
$ perl -p -e 's/(\$)(?![\d.]+)(.+?)(\$)/<equation>$2<\/equation>/g' joseph.txt
This is the beginning of a paragraph <equation>first equation</equation> ...and here is some text that includes a single dollar sign: He paid $2.50 for a pack of cigarettes... <equation>second equation</equation> ...and here is more text... <equation>third equation</equation> ...and here is yet more text... <equation>fourth equation</equation>
$
Upvotes: 3