vinaya
vinaya

Reputation: 272

Does javascript consider everything enclosed in <> as html tags?

I am tasked with converting hundreds of Word document pages into a knowledge base html application. This means copying and pasting the HTML of the word document into an editor like Notepad++ and cleaning it up. (Since it is internal document I need to convert, I cannot use online converters).

I have been able to do most of what I need with a javascript function that works "onload" of the body tag. I then copy the resulting HTML into my application framework.

Here is part of the function I wrote: (it shows only code for removing attributes of div and p tags but works for all html tags in the document)

    function removeatts() //this function will remove all attributes from all elements and also remove empty span elements

    {//for removing div  tag attributes
    var divs=document.getElementsByTagName('div'); //look at all div tags
    var divnum=divs.length; //number of div tags on the page

        for (var i=0; i<divnum; i++) //run through all the div tags
        {//remove attributes for each div tag

            divs[i].removeAttribute("class");
            divs[i].removeAttribute("id");
            divs[i].removeAttribute("name");    
            divs[i].removeAttribute("style");
            divs[i].removeAttribute("lang");

        }

        //for removing p  tag attributes
        var ps=document.getElementsByTagName('p'); //look at all p tags
        var pnum=ps.length; //number of p tags on the page

        for (var i=0; i<pnum; i++) //run through all the p tags
        {//remove attributes for each p tag
            var para=ps[i].innerHTML;
            if (para.length!==0) //ie if there is content inside the p tag
            {
                ps[i].removeAttribute("class");
                ps[i].removeAttribute("id");
                ps[i].removeAttribute("name");  
                ps[i].removeAttribute("style");
                ps[i].removeAttribute("lang");
            }
            else
            {//remove empty p tag

                ps[i].remove() ;
                                }

            if (para=="<o:p></o:p>" || para=="<o:p> </o:p>" || para=="<o:p>  </o:p>") 
            {
                ps[i].remove() ;

            }
        }

The first problem I encountered is that if I included the if (para=="<o:p></o:p>" || para=="<o:p> </o:p>" || para=="<o:p> </o:p>") part in an else if statement, the whole function stopped executing.

However, without the if (para=="<o:p></o:p>" || para=="<o:p> </o:p>" || para=="<o:p> </o:p>") part, the function does exactly what it is supposed to.

If, however, I keep it the way it is right now, it does some of what I want it to do.

The trouble occurs over some of the Word generated html that looks like this:

      <p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto; margin-
    left:.25in;text-align:justify;text-indent:-.25in;line-height:150%;
    mso-list:l0 level1 lfo1;tab-stops:list .75in'>
    <![if !supportLists]><span style='font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:Symbol;color:black'><span style='mso-list:Ignore'>·
    <span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    </span></span></span>
    <![endif]><span style='font-family:"Arial","sans-serif";mso-fareast-font-family:Calibri;color:black'>
    SOME TEXT.<span style='mso-spacerun:yes'>  </span>SOME MORE TEXT.<span style='mso-spacerun:yes'>  </span>EVEN MORE TEXT.
    <span style='mso-spacerun:yes'>  </span>BLAH BLAH BLAH.<o:p></o:p></span></p>
    <p><o:p></o:p></p>

Notice the <o:p></o:p> in the last two lines..... This is not getting removed either when treated as plain text or if I write code for it in the function just like the divs and paragraphs as shown in the function above. When I run the function on this, I get

    <p>
    <![if !supportLists]><span>·
    <span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    </span></span></span>
    <![endif]><span>
    SOME TEXT.<span>  </span>SOME MORE TEXT.<span>  </span>EVEN MORE TEXT.
    <span>  </span>BLAH BLAH BLAH.<o:p></o:p></span></p>
    <p><o:p></o:p></p>

I have looked around but cannot find any information about whether javascript works the same on known html tags and on something like this that follows the principle of opening and closing tags but doesn't match known HTML tags!

Any ideas about a workaround would be greatly appreciated!

Upvotes: 0

Views: 327

Answers (3)

jfriend00
jfriend00

Reputation: 707706

Javascript has no special processing of HTML tags in javascript strings. It honestly doesn't know anything about HTML in the string.

More likely your issue is trying to compare .innerHTML of a tag to a predetermined string. You cannot and should not do that because there is no guarentee for the format of .innerHTML. As there are hundreds of ways that the same HTML can be formatted and some browsers don't remember the original HTML, but reconstitue it when you ask for .innerHTML, you simply can't do that type of string comparison.

To be sure of your comparison, you will have to actually parse the HTML (at least with some sort of crude parser which perhaps could even be a regex) to see if it matches what you want because you can't rely on optional spacing or optional capitilization in a direct string comparison.

Or, perhaps even better, since your HTML is already parsed, why not just look at the actual HTML objects themselves and see if you have what you want there. You shouldn't even have to remove all those attributes then.

Upvotes: 1

user123444555621
user123444555621

Reputation: 153154

The first problem I encountered is that if I included the if (para=="<o:p></o:p>" || para=="<o:p> </o:p>" || para=="<o:p> </o:p>") part in an else if statement, the whole function stopped executing.

This is because you cannot have else if after else.

Notice the <o:p></o:p> in the last two lines..... This is not getting removed

I cannot confirm that. When I run your function it removes the <o:p> inside the <p>, as it is supposed to. The <o:p> within the <span> is not processed, because your function does not do that.

If you want to remove all <o:p>s, try

[].forEach.call(document.querySelectorAll('o\\:p'), function (el) {
    el.remove();
});

After that, you may want to remove empty <p>s like this

[].forEach.call(document.querySelectorAll('p'), function (el) {
    if (!el.childNodes.length) {
        el.remove();
    }
});

Upvotes: 0

loxxy
loxxy

Reputation: 13151

It's not Javascript that is unhappy with the unknown tags. It's the browser.

For JS it's simply a string. So, if it's a very specific case that you don't need <o:p> in particular then you could just remove it by running it with a regex itself.

para.replace(/<[/]?o:p>/ig, "");

But if there are many more, I would strongly suggest you to get familiar with XSLT transformation.

Upvotes: 0

Related Questions