XenPanda
XenPanda

Reputation: 137

Matching pattern using reg ex and re.sub

I am trying to remove the following pattern from some data and am getting mixed results.

--endof["somerandomtext"]

Basically the text always starts with --endof[" and ends with "] and the words in between change.
The line of code I am using that is not working currently working all the time.

d = re.sub('--+([a-zA-Z0-9_"-\[]*)+\]', " ", d)

I am new to trying to parse data using re.sub or any method. I have been just guessing at how to try and make this line work, and I probably have something wrong that is causing me problems.

Any help appreciated.

Upvotes: 0

Views: 47

Answers (2)

Hexagon
Hexagon

Reputation: 6961

To remove text starting with --endof[" and ending with "], you should match these as exact characters, and match a substring in the middle.

Because [ and ] have special meaning in a regular expression, you need to escape them with \ (as correctly stated in a comment, ] doesn't have to be escaped here, leaving it escpaed for extra clarity).

In this example, the substring in the middle is composed of one or more letters and digits (hence the +). It can be altered as needed.

str = re.sub('--endof\["[a-zA-Z0-9]+"\]', "", str)

To break this up further -

--endof matches these characters exactly.
\[ matches the character [.
" matches the character ".
[a-zA-Z0-9]+ matches a string consisting of one or more letters and digits (+ is for "one or more").
" again matches the character ".
\] matches the character ] (and can be specified as ] alone).

Upvotes: 2

DYZ
DYZ

Reputation: 57033

A variation of @Hexagon's answer:

s = re.sub('--endof\[[^]]+]', '', s)

This removes a string that starts with --endof[, followed by any number of non-]s ([^]]+), followed by a ]. Works for any text that does not contain closing brackets.

Upvotes: 1

Related Questions