Reputation: 33
I am trying to extract all the latex commands from a tex file. I have to use Python for this. I tried to extract the latex commands in a list using Re module.
The problem is that this list does not contain the latex commands whose name includes special characters (such as \alpha*, \a', \#, \$, +, :, \; etc). It only contains the latex commands that consist of letters.
I am presently using the re.match python command :
"I already know the starting index of '\' which is at self.i.
The example Latex code string could be:
\documentclass[envcountsame,envcountchap]{svmono}"
match_text = re.match("[\w]+", search_string[self.i + 1:])
I am able to extract 'documentclass'. But suppose there is another command like:
"\abstract*[alpha]{beta}"
"\${This is a latex document}"
"\:"
How do I extract only 'abstract*', '$', ':' from these strings?
I am new to Python and tried various approaches, but am not able to extract all these command names. If there is a general python Regex that can handle all these cases, it would be useful.
NOTE: A book called 'The Not So Short introduction to LaTeX' defines that the format of LaTeX commands can be of three types -
FORMATS:
They start with a backslash \ and then have a name consisting of letters only. Command names are terminated by a space, a number or any other ‘non-letter.’
They consist of a backslash and exactly one non-letter.
Many commands exist in a ‘starred variant’ where a star is appended to the command name.
Upvotes: 3
Views: 1498
Reputation: 12698
LaTeX is a TeX macro package, and as so, all that's applicable to TeX is also applicable to LaTeX.
The question you ask is a difficult one, as TeX is not a regular language. If you want only to deal with commands, you have to check for \\([A-Za-z]+ *|.|\n)
regex (see demo), with the notice that in TeX you have active characters, that is, characters for which the only presence acts like a command. If you want to deal with command parameters, you'll have to check the individual command definitions, because TeX is a Polish Notation (operators or commands are prefix, with a variable number of positional parameters) language. For parameter extraction, TeX uses brace matching which is context free and not regular, so you'll need a complete parser for that.
TeX allows you to redefine all character classes, so you can redefine the digits to act as letters, and be usable as command names (so for example \a23
is a valid command name) (this happens inside the package definitions, where the @
is used as a letter, to be able to make commands that are inaccessible to users, but available inside the package)
Eliminating LaTeX markup is a difficult thing for this reason and you can only achieve partial results. There are many different problems to be solved (what to do with \include
directives, what to do with valid text in parameters like \chapter
parameters or \footnote
, you want the index included, etc.)
Also, you have to be carefull, as if you try to eliminate command parameters, you'll be also eliminating part of your text (for example the text in \footnote
, \abstract
, \title
, \chapter{...}
, etc.) I don't know the effect you actually want to get, so I cannot give you more info in this respect.
Upvotes: 0
Reputation: 51400
Here's the exact translation of your format specification:
\\(?:[^a-zA-Z]|[a-zA-Z]+)\*?
[^a-zA-Z]
[a-zA-Z]+
\*?
If your format description is accurate, this should do it. Unfortunately I don't know LaTeX so I'm not sure it's 100% OK.
From the feedback in the comments, it turns out the star is applicable only to letter commands, and there can be some other terminating characters as well. The final regex is:
\\(?:[^a-zA-Z]|[a-zA-Z]+[*=']?)
Upvotes: 3