Vishnukk
Vishnukk

Reputation: 564

Unwanted characters in the HTML beautified text

I have my original web scraped HTML text as this

> {"overview":"\\u003cp\\u003e\\u003cspan style=\\"font-size:
> 10.5pt;\\"\\u003e\\u003cspan class=\\"TextRun SCXW87260372 BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
> -webkit-tap-highlight-color: transparent; color: #000000; font-family: \'Meiryo UI\', \'Meiryo UI_MSFontService\', sans-serif; font-kerning:
> none; line-height: 15.1083px; font-variant-ligatures: none
> !important;\\"\\u003e\\u003cspan class=\\"NormalTextRun SCXW87260372
> BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
> -webkit-tap-highlight-color: transparent; background-color: inherit;\\"\\u003eFioriアプリの動作確認で、2通りのトラブルシューティングをする\\u003c/span\\u003e\\u003c/span\\u003e\\u003cspan
> class=\\"EOP SCXW87260372 BCX0\\" style=\\"margin: 0px;.....

I used the BeautifulSoup to eliminate all the HTML tags using the below code

def beautify_full_text(content):
    try:
        soup = BeautifulSoup(content.encode('utf-8').decode('unicode-escape'), "html.parser")
        for tag in soup():
            for attribute in ["class", "id", "name", "style"]:
                del tag[attribute]
    
        return os.linesep.join([s for s in soup.text.splitlines() if s])
    except Exception as e:
        print(e)
        return

I now see that the returned text has no HTML Tags but has the below text

{"overview":"Fioriã\x82¢ã\x83\x97ã\x83ªã\x81®å\x8b\x95ä½\x9c確èª\x8dã\x81§ã\x80\x81ï¼\x92é\x80\x9aã\x82\x8aã\x81®ã\x83\x88ã\x83©ã\x83\x96ã\x83«ã\x82·ã\x83¥ã\x83¼ã\x83\x86ã\x82£ã\x83³ã\x82°ã\x82\x92ã\x81\x99ã\x82\x8bÂ\xa0\nGatewayã\x81®ã\x82¨ã\x83©ã\x83¼ã\x83\xadã\x82°ã\x82\x92確èª\x8dÂ\xa0\nã\x83\x96ã\x83©ã\x82¦ã\x82¶ã\x81®ã\x82³ã\x83³ã\x82½ã\x83¼ã\x83«ã\x81§ICFã\x82µã\x83¼ã\x83\x93ã\x82¹ç\xad\x89ã\x81§403/403ã\x81\x8cå\x87ºã\x81¦ã\x81\x84ã\x81ªã\x81\x84ã\x81\x8b\nÂ\xa0â\x80¯Â\xa0[Gateway
Foundation] Which Tools Can Be Used for
Troubleshooting?Â\xa0\n極å\x8a\x9bã\x83\xadã\x82°ã\x82ªã\x83³è¨\x80èª\x9eï¼\x9dè\x8b±èª\x9eã\x81«ã\x81\x97ã\x81¦ã\x80\x81ã\x80\x8cggrksã\x80\x8dã\x82\x92ã\x82ªã\x83\x96ã\x83©ã\x83¼ã\x83\x88ã\x81«å\x8c\nã\x82\x93ã\x81§è¨\x80ã\x81\x86Â\xa0\n"}

Is there a way I can eliminate these unwanted characters as well?

Upvotes: 0

Views: 125

Answers (2)

Vishnukk
Vishnukk

Reputation: 564

It turns out that a small tweak solved the problem. Currently, the code looks as below

def beautify_full_text(content):
    try:
        soup = BeautifulSoup(content.encode('utf-8').decode('unicode-escape'), "html.parser")
        for tag in soup():
            for attribute in ["class", "id", "name", "style"]:
                del tag[attribute]
    
        beau_text = os.linesep.join([s for s in soup.text.splitlines() if s])
        beau_text = beau_text.encode("ascii", "ignore").decode()
        return beau_text
    except Exception as e:
        print(e)
        return

Upvotes: 0

Mark Tolonen
Mark Tolonen

Reputation: 178115

The problem with the unicode-escape codec is that it decodes the escape codes, but also decodes to latin1. Since you have non-latin1 characters in the stream, re-encode as latin1 to undo the incorrect decoding and decode as utf8 again:

s='''\
{"overview":"\\u003cp\\u003e\\u003cspan style=\\"font-size:
10.5pt;\\"\\u003e\\u003cspan class=\\"TextRun SCXW87260372 BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; color: #000000; font-family: \'Meiryo UI\', \'Meiryo UI_MSFontService\', sans-serif; font-kerning:
none; line-height: 15.1083px; font-variant-ligatures: none
!important;\\"\\u003e\\u003cspan class=\\"NormalTextRun SCXW87260372
BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; background-color: inherit;\\"\\u003eFioriアプリの動作確認で、2通りのトラブルシューティングをする\\u003c/span\\u003e\\u003c/span\\u003e\\u003cspan
class=\\"EOP SCXW87260372 BCX0\\" style=\\"margin: 0px;'''

print(s.encode('utf8').decode('unicode-escape').encode('latin1').decode('utf8'))

Output:

{"overview":"<p><span style="font-size:
10.5pt;"><span class="TextRun SCXW87260372 BCX0" style="margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; color: #000000; font-family: 'Meiryo UI', 'Meiryo UI_MSFontService', sans-serif; font-kerning:
none; line-height: 15.1083px; font-variant-ligatures: none
!important;"><span class="NormalTextRun SCXW87260372
BCX0" style="margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; background-color: inherit;">Fioriアプリの動作確認で、2通りのトラブルシューティングをする</span></span><span
class="EOP SCXW87260372 BCX0" style="margin: 0px;

Now that it is decoded, it looks more like it was a JSON response. If you used the requests module to retrieve the data look at response.json() to see if it decodes correctly, or use json.loads() on your scraped string.

Upvotes: 1

Related Questions