Roee Adler
Roee Adler

Reputation: 33990

Regular Expression for Stripping Strings from Source Code

I'm looking for a regular expression that will replace strings in an input source code with some constant string value such as "string", and that will also take into account escaping the string-start character that is denoted by a double string-start character (e.g. "he said ""hello""").

To clarify, I will provide some examples of input and expected output:

input: print("hello world, how are you?")
output: print("string")

input: print("hello" + "world")
output: print("string" + "string")

# here's the tricky part:
input: print("He told her ""how you doin?"", and she said ""I'm fine, thanks""")
output: print("string")

I'm working in Python, but I guess this is language agnostic.

EDIT: According to one of the answers, this requirement may not be fit for a regular expression. I'm not sure that's true but I'm not an expert. If I try to phrase my requirement with words, what I'm looking for is to find sets of characters that are between double quotes, wherein even groups of adjacent double quotes should be disregarded, and that sounds to me like it can be figured by a DFA.

Thanks.

Upvotes: 0

Views: 435

Answers (3)

Carl Meyer
Carl Meyer

Reputation: 126591

If you're parsing Python code, save yourself the hassle and let the standard library's parser module do the heavy lifting.

If you're writing your own parser for some custom language, it's awfully tempting to start out by just hacking together a bunch of regexes, but don't do it. You'll dig yourself into an unmaintainable mess. Read up on parsing techniques and do it right (wikipedia can help).

This regex does the trick for all three of your examples:

re.sub(r'"(?:""|[^"])+"', '"string"', original)

Upvotes: 3

PAG
PAG

Reputation: 1946

There's a very good string-matching regular expression over at ActiveState. If it doesn't work straight out for your last example it should be a fairly trivial repeat to group adjacent quoted strings together.

Upvotes: 0

Douglas Leeder
Douglas Leeder

Reputation: 53310

Maybe:

re.sub(r"[^\"]\"[^\"].*[^\"]\"[^\"]",'"string"',input)

EDIT:

No that won't work for the final example.

I don't think your requirements are regular: they can't be matched by a regular expression. This is because at the heart of the matter, you need to match any odd number of " grouped together, as that is your delimiter.

I think you'll have to do it manually, counting "s.

Upvotes: 0

Related Questions