Cleanup old database full of html tags

Question

I'm moving my client's old mysql database to a new wordpress system (the old one was also wp), and i've noticed his articles are all saved with tons and tons of HTML

tags full of different stylings due to importing content directly from MS Word. I've already convinced the client to use Paste From Word and clean up after his articles before saving new ones.

Now, is there any safe way to remove all of the already saved tags

without leaving trash behind and hopefully keeping the original line breaks?

I've started researching on regex, but lots of answers here advise against using it to parse HTML, though. Any clues?

Brandt Solovij · Accepted Answer

Here is a safe process I use during a "pre render cleanup" process from a similar DB situation (html being stored) It is unfortunately written in Java but the concept (and regex used) can apply to a SQL update query.

One note is I'd recommend not only backing up prior to doing this, but testing on a "safe" version of the DB. Of course for any update procedure of this size, you likely already know the risks.

on note : the "BLOCK OF HTML TO CLEAN" should not be interpreted as a string literal but rather just a note saying "displayContent is the variable holding the DB's html result, in this case just 1 iteration of the resultset within a loop.

String displayContent = "THE BLOCK OF HTML TO CLEAN";
String tagregex = "]*>";
Pattern p2 = Pattern.compile(tagregex);
Matcher m2 = p2.matcher(displayContent);
displayContent = m2.replaceAll("");
displayContent = displayContent.replaceAll("", "");

You can of course use this for any other html tags + their attributes. Good luck!

Cleanup old database full of html tags

Answers (1)

Related Questions