How to replace string multiple times?

Question

I have a 10 000 lines source code with tons of duplication. So I read in the file as text.

Example:

    assert PyArray_TYPE(real0) == np.NPY_DOUBLE, "real0 is not double"
    assert real0.ndim == 1, "real0 has wrong dimensions"
    if not (PyArray_FLAGS(real0) & np.NPY_C_CONTIGUOUS):
        real0 = PyArray_GETCONTIGUOUS(real0)
    real0_data = real0.data

I want to replace all occurances of this pattern with

    real0_data = _get_data(real0, "real0")

where real0 can be any variable name [a-z0-9]+

So don't get confused by the source code. The code doesn't matter, this is text processing and regex.

This is what I have so far:

    PATH = "func.pyx"
    source_string = open(PATH,"r").read()

    pattern = r"""
    assert PyArray_TYPE$([a-z0-9]+)$ == np.NPY_DOUBLE, "([a-z0-9]+) is not double"
    assert ([a-z0-9]+).ndim == 1, "([a-z0-9]+) has wrong dimensions"
    if not (PyArray_FLAGS(([a-z0-9]+)) & np.NPY_C_CONTIGUOUS):
       ([a-z0-9]+) = PyArray_GETCONTIGUOUS(([a-z0-9]+))
    ([a-z0-9]+)_data = ([a-z0-9]+).data"""

Michael Geary · Accepted Answer

You can do this in any text editor that supports multiline regular expression search and replace.

I used Komodo IDE to test this, because it includes an excellent regular expression tester ("Rx Toolkit") for experimenting with regular expressions. I think there are also some online tools like this. The same regular expression works in the free Komodo Edit. It should also work in most other editors that support Perl-compatible regular expressions.

In Komodo, I used the Replace dialog with the Regex option checked, to find:

assert PyArray_TYPE$(\w+)$ == np\.NPY_DOUBLE, "\1 is not double"\s*\n\s*assert \1\.ndim == 1, "\1 has wrong dimensions"\s*\n\s*if not $PyArray_FLAGS\(\1$ & np\.NPY_C_CONTIGUOUS\):\s*\n\s*\1 = PyArray_GETCONTIGUOUS$\1$\s*\n\s*\1_data = \1\.data

and replace it with:

\1_data = _get_data(\1, "\1")

Given this test code:

    assert PyArray_TYPE(real0) == np.NPY_DOUBLE, "real0 is not double"
    assert real0.ndim == 1, "real0 has wrong dimensions"
    if not (PyArray_FLAGS(real0) & np.NPY_C_CONTIGUOUS):
        real0 = PyArray_GETCONTIGUOUS(real0)
    real0_data = real0.data

    assert PyArray_TYPE(real1) == np.NPY_DOUBLE, "real1 is not double"
    assert real1.ndim == 1, "real1 has wrong dimensions"
    if not (PyArray_FLAGS(real1) & np.NPY_C_CONTIGUOUS):
        real1 = PyArray_GETCONTIGUOUS(real1)
    real1_data = real1.data

    assert PyArray_TYPE(real2) == np.NPY_DOUBLE, "real2 is not double"
    assert real2.ndim == 1, "real2 has wrong dimensions"
    if not (PyArray_FLAGS(real2) & np.NPY_C_CONTIGUOUS):
        real2 = PyArray_GETCONTIGUOUS(real2)
    real2_data = real2.data

The result is:

    real0_data = _get_data(real0, "real0")

    real1_data = _get_data(real1, "real1")

    real2_data = _get_data(real2, "real2")

So how did I get that regular expression from your original code?

Prefix all instances of (, ), ., and * with \ to escape them (an easy manual search and replace).
Replace the first instance of real0 with (\w+). This matches and captures a string of alphanumeric characters.
Replace the remaining instances of real0 with \1. This matches the text captured by (\w+).
Replace each newline and the leading space on the next line with \s*\n\s*. This matches any trailing space on the line, plus the newline, plus all leading space on the next line. That way the regular expression works regardless of the nesting level of the code it's matching.

Finally, the "replace" text uses \1 where it needs the original captured text.

You could of course use a similar regular expression in Python if you want to do it that way. I would suggest using \w instead of [a-z0-9] just to make it simpler. Also, don't include the newlines and leading spaces; instead use the \s*\n\s* approach I used instead of the multiline string. This way it will be independent of the nesting level as I mentioned above.

How to replace string multiple times?

Answers (1)

Related Questions