Reputation: 24104
I have a 10 000 lines source code with tons of duplication. So I read in the file as text.
Example:
assert PyArray_TYPE(real0) == np.NPY_DOUBLE, "real0 is not double"
assert real0.ndim == 1, "real0 has wrong dimensions"
if not (PyArray_FLAGS(real0) & np.NPY_C_CONTIGUOUS):
real0 = PyArray_GETCONTIGUOUS(real0)
real0_data = <double*>real0.data
I want to replace all occurances of this pattern with
real0_data = _get_data(real0, "real0")
where real0 can be any variable name [a-z0-9]+
So don't get confused by the source code. The code doesn't matter, this is text processing and regex.
This is what I have so far:
PATH = "func.pyx" source_string = open(PATH,"r").read() pattern = r""" assert PyArray_TYPE\(([a-z0-9]+)\) == np.NPY_DOUBLE, "([a-z0-9]+) is not double" assert ([a-z0-9]+).ndim == 1, "([a-z0-9]+) has wrong dimensions" if not (PyArray_FLAGS(([a-z0-9]+)) & np.NPY_C_CONTIGUOUS): ([a-z0-9]+) = PyArray_GETCONTIGUOUS(([a-z0-9]+)) ([a-z0-9]+)_data = ([a-z0-9]+).data"""
Upvotes: 0
Views: 457
Reputation: 28870
You can do this in any text editor that supports multiline regular expression search and replace.
I used Komodo IDE to test this, because it includes an excellent regular expression tester ("Rx Toolkit") for experimenting with regular expressions. I think there are also some online tools like this. The same regular expression works in the free Komodo Edit. It should also work in most other editors that support Perl-compatible regular expressions.
In Komodo, I used the Replace dialog with the Regex option checked, to find:
assert PyArray_TYPE\((\w+)\) == np\.NPY_DOUBLE, "\1 is not double"\s*\n\s*assert \1\.ndim == 1, "\1 has wrong dimensions"\s*\n\s*if not \(PyArray_FLAGS\(\1\) & np\.NPY_C_CONTIGUOUS\):\s*\n\s*\1 = PyArray_GETCONTIGUOUS\(\1\)\s*\n\s*\1_data = <double\*>\1\.data
and replace it with:
\1_data = _get_data(\1, "\1")
Given this test code:
assert PyArray_TYPE(real0) == np.NPY_DOUBLE, "real0 is not double"
assert real0.ndim == 1, "real0 has wrong dimensions"
if not (PyArray_FLAGS(real0) & np.NPY_C_CONTIGUOUS):
real0 = PyArray_GETCONTIGUOUS(real0)
real0_data = <double*>real0.data
assert PyArray_TYPE(real1) == np.NPY_DOUBLE, "real1 is not double"
assert real1.ndim == 1, "real1 has wrong dimensions"
if not (PyArray_FLAGS(real1) & np.NPY_C_CONTIGUOUS):
real1 = PyArray_GETCONTIGUOUS(real1)
real1_data = <double*>real1.data
assert PyArray_TYPE(real2) == np.NPY_DOUBLE, "real2 is not double"
assert real2.ndim == 1, "real2 has wrong dimensions"
if not (PyArray_FLAGS(real2) & np.NPY_C_CONTIGUOUS):
real2 = PyArray_GETCONTIGUOUS(real2)
real2_data = <double*>real2.data
The result is:
real0_data = _get_data(real0, "real0")
real1_data = _get_data(real1, "real1")
real2_data = _get_data(real2, "real2")
So how did I get that regular expression from your original code?
(
, )
, .
, and *
with \
to escape them (an easy manual search and replace).real0
with (\w+)
. This matches and captures a string of alphanumeric characters.real0
with \1
. This matches the text captured by (\w+)
.\s*\n\s*
. This matches any trailing space on the line, plus the newline, plus all leading space on the next line. That way the regular expression works regardless of the nesting level of the code it's matching.Finally, the "replace" text uses \1
where it needs the original captured text.
You could of course use a similar regular expression in Python if you want to do it that way. I would suggest using \w
instead of [a-z0-9]
just to make it simpler. Also, don't include the newlines and leading spaces; instead use the \s*\n\s*
approach I used instead of the multiline string. This way it will be independent of the nesting level as I mentioned above.
Upvotes: 1