Reputation: 137
I am trying to remove the following pattern from some data and am getting mixed results.
--endof["somerandomtext"]
Basically the text always starts with --endof["
and ends with "]
and the words in between change.
The line of code I am using that is not working currently working all the time.
d = re.sub('--+([a-zA-Z0-9_"-\[]*)+\]', " ", d)
I am new to trying to parse data using re.sub or any method. I have been just guessing at how to try and make this line work, and I probably have something wrong that is causing me problems.
Any help appreciated.
Upvotes: 0
Views: 47
Reputation: 6961
To remove text starting with --endof["
and ending with "]
, you should match these as exact characters, and match a substring in the middle.
Because [
and ]
have special meaning in a regular expression, you need to escape them with \
(as correctly stated in a comment, ]
doesn't have to be escaped here, leaving it escpaed for extra clarity).
In this example, the substring in the middle is composed of one or more letters and digits (hence the +
). It can be altered as needed.
str = re.sub('--endof\["[a-zA-Z0-9]+"\]', "", str)
To break this up further -
--endof
matches these characters exactly.
\[
matches the character [
.
"
matches the character "
.
[a-zA-Z0-9]+
matches a string consisting of one or more letters and digits (+
is for "one or more").
"
again matches the character "
.
\]
matches the character ]
(and can be specified as ]
alone).
Upvotes: 2
Reputation: 57033
A variation of @Hexagon's answer:
s = re.sub('--endof\[[^]]+]', '', s)
This removes a string that starts with --endof[
, followed by any number of non-]
s ([^]]+
), followed by a ]
. Works for any text that does not contain closing brackets.
Upvotes: 1