Reputation: 31
I'm trying to find a way to strip BBCode from a string. The modules that I've found (BBCode and Post Markup) seem to only translate them to HTML rather than just remove the BBCode and return a clean string. If I'm missing something and one of those actually does what I'm asking I'd love some direction on it :)
Otherwise, are there any ways to strip BB Code from a string and return plain text?
Upvotes: 1
Views: 1684
Reputation: 102862
Your answer is actually within the bbcode
module. Unfortunately, the relevant method is not in the documentation, but if you search through the code it's there: Parser.strip()
. For example:
import bbcode
parser = bbcode.Parser()
code = "[code]a = [1, 2, 3, 4, 5][/code]"
plain_txt = parser.strip(code)
print(plain_txt)
'a = [1, 2, 3, 4, 5]'
Unfortunately, both Robᵩ's regex-based answer and postmarkup
suffer from the same problem - the inability to differentiate between BBCode ([list][*]Item 1[*]Item 2[/list]
, [color=red]I hate color-blind people![/color]
, etc.) and the embedded code example I used above (they both return a =
), or a line like
I'm feeling sad :[ But, eating ice cream cheers me up! :]
which simply returns
I'm feeling sad :
This is possible because bbcode
tokenizes the string first, searching for valid BBCode tokens, and identifying the rest as just parts of the overall text. Parser.strip()
then just throws the BBCode tokens away and reassembles the text, while the formatting methods turn those tokens into XHTML markup, and splice in the rest where appropriate.
Upvotes: 3
Reputation: 168636
Depending upon your needs, this might be sufficient:
#UNTESTED
import re
with open("some_input_file.txt") as input_file:
for s in input_file:
s = re.sub('\[.*?]','',s)
print s
Upvotes: 0