Manuel ANDIA
Manuel ANDIA

Reputation: 43

Replace text in HTML and BBCode sample

First of all I'd like to say this is my first post on SO, which has been of great help for years to me, so thank you all!

Now onto my question:

Sample:

This is my sample text.
It may contain <a href="http://www.somesite.org/test.htm">HTML tags</a>,
[b]BBCode[b],
or even <a href="http://www.someothersite.com/">[b][u]both[/u] intricated[/b]</a>!

Sample:

kw = {'sample': 'http://www.sample.fr', 'BBCode': 'http://www.bbcode.sp'}

As you can see I'm currently using Python because I'm used to the language, but I can be flexible.

My goal is to detect which word(s) in my keyword list is present in the sample text, and to "decorate" the matching word(s) with a link (preferably in bbcode) to the corresponding URL, without altering the rest of the string (just like for Wikis).

Taking further the examples above I'd like to retrieve:

This is my [url=http://www.sample.fr]sample[/url] text.
It may contain <a href="http://www.somesite.org/test.htm">HTML tags</a>,
[b][url=http://www.bbcode.sp]BBCode[/url][b],
or even <a href="http://www.someothersite.com/">[b][u]both[/u] intricated[/b]</a>!

The main problem here is that sometimes, one of the keywords in my list appears inside a tag, which I do not want to "decorate" with a link for obvious reasons.

In other words, the text I'd like to replace can be located only outside the anchor tags:

**HERE** <not here>[not here] **HERE** [/not here]</not here> **HERE**

Also, I've already tried using BeautifulSoup (along with PostMarkup to convert BBCode to HTML before parsing with BeautifulSoup) but it doesn't allow me to keep the initial string...

Remark: "real" text actually can never be placed between brackets (angle nor squared) due to the general usage of my forum, so this simplifies the problem quite a bit.

I'm sorry for my very long question, I hope everything is clear!

Any help appreciated, thanks to everyone by advance!

Update: Casimir's solution in Python (see below) works just great. Thank you Casimir et Hippolyte!

Upvotes: 1

Views: 865

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

To do that, the way is always the same: you must match first what you want to avoid.

Example:

(?s)     # dotall mode
(      # capture with all what you want to avoid
    <!--.*?--> # html comment
  |
    <[^>]+> # html tag
  |
    \[[^\]]+\] # bbcode
)
|    # OR
kw1|kw2|kw3|...

Then you must use a function as replacement, inside the function when the capture group 1 is defined, you return the match, otherwise you return the corresponding string for the keyword.

Upvotes: 3

Related Questions