Reputation: 564
I have my original web scraped HTML text as this
> {"overview":"\\u003cp\\u003e\\u003cspan style=\\"font-size:
> 10.5pt;\\"\\u003e\\u003cspan class=\\"TextRun SCXW87260372 BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
> -webkit-tap-highlight-color: transparent; color: #000000; font-family: \'Meiryo UI\', \'Meiryo UI_MSFontService\', sans-serif; font-kerning:
> none; line-height: 15.1083px; font-variant-ligatures: none
> !important;\\"\\u003e\\u003cspan class=\\"NormalTextRun SCXW87260372
> BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
> -webkit-tap-highlight-color: transparent; background-color: inherit;\\"\\u003eFioriアプリの動作確認で、2通りのトラブルシューティングをする\\u003c/span\\u003e\\u003c/span\\u003e\\u003cspan
> class=\\"EOP SCXW87260372 BCX0\\" style=\\"margin: 0px;.....
I used the BeautifulSoup to eliminate all the HTML tags using the below code
def beautify_full_text(content):
try:
soup = BeautifulSoup(content.encode('utf-8').decode('unicode-escape'), "html.parser")
for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]
return os.linesep.join([s for s in soup.text.splitlines() if s])
except Exception as e:
print(e)
return
I now see that the returned text has no HTML Tags but has the below text
{"overview":"Fioriã\x82¢ã\x83\x97ã\x83ªã\x81®å\x8b\x95ä½\x9c確èª\x8dã\x81§ã\x80\x81ï¼\x92é\x80\x9aã\x82\x8aã\x81®ã\x83\x88ã\x83©ã\x83\x96ã\x83«ã\x82·ã\x83¥ã\x83¼ã\x83\x86ã\x82£ã\x83³ã\x82°ã\x82\x92ã\x81\x99ã\x82\x8bÂ\xa0\nGatewayã\x81®ã\x82¨ã\x83©ã\x83¼ã\x83\xadã\x82°ã\x82\x92確èª\x8dÂ\xa0\nã\x83\x96ã\x83©ã\x82¦ã\x82¶ã\x81®ã\x82³ã\x83³ã\x82½ã\x83¼ã\x83«ã\x81§ICFã\x82µã\x83¼ã\x83\x93ã\x82¹ç\xad\x89ã\x81§403/403ã\x81\x8cå\x87ºã\x81¦ã\x81\x84ã\x81ªã\x81\x84ã\x81\x8b\nÂ\xa0â\x80¯Â\xa0[Gateway
Foundation] Which Tools Can Be Used for
Troubleshooting?Â\xa0\n極å\x8a\x9bã\x83\xadã\x82°ã\x82ªã\x83³è¨\x80èª\x9eï¼\x9dè\x8b±èª\x9eã\x81«ã\x81\x97ã\x81¦ã\x80\x81ã\x80\x8cggrksã\x80\x8dã\x82\x92ã\x82ªã\x83\x96ã\x83©ã\x83¼ã\x83\x88ã\x81«å\x8c\nã\x82\x93ã\x81§è¨\x80ã\x81\x86Â\xa0\n"}
Is there a way I can eliminate these unwanted characters as well?
Upvotes: 0
Views: 125
Reputation: 564
It turns out that a small tweak solved the problem. Currently, the code looks as below
def beautify_full_text(content):
try:
soup = BeautifulSoup(content.encode('utf-8').decode('unicode-escape'), "html.parser")
for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]
beau_text = os.linesep.join([s for s in soup.text.splitlines() if s])
beau_text = beau_text.encode("ascii", "ignore").decode()
return beau_text
except Exception as e:
print(e)
return
Upvotes: 0
Reputation: 178115
The problem with the unicode-escape
codec is that it decodes the escape codes, but also decodes to latin1
. Since you have non-latin1
characters in the stream, re-encode as latin1
to undo the incorrect decoding and decode as utf8
again:
s='''\
{"overview":"\\u003cp\\u003e\\u003cspan style=\\"font-size:
10.5pt;\\"\\u003e\\u003cspan class=\\"TextRun SCXW87260372 BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; color: #000000; font-family: \'Meiryo UI\', \'Meiryo UI_MSFontService\', sans-serif; font-kerning:
none; line-height: 15.1083px; font-variant-ligatures: none
!important;\\"\\u003e\\u003cspan class=\\"NormalTextRun SCXW87260372
BCX0\\" style=\\"margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; background-color: inherit;\\"\\u003eFioriアプリの動作確認で、2通りのトラブルシューティングをする\\u003c/span\\u003e\\u003c/span\\u003e\\u003cspan
class=\\"EOP SCXW87260372 BCX0\\" style=\\"margin: 0px;'''
print(s.encode('utf8').decode('unicode-escape').encode('latin1').decode('utf8'))
Output:
{"overview":"<p><span style="font-size:
10.5pt;"><span class="TextRun SCXW87260372 BCX0" style="margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; color: #000000; font-family: 'Meiryo UI', 'Meiryo UI_MSFontService', sans-serif; font-kerning:
none; line-height: 15.1083px; font-variant-ligatures: none
!important;"><span class="NormalTextRun SCXW87260372
BCX0" style="margin: 0px; padding: 0px; -webkit-user-drag: none;
-webkit-tap-highlight-color: transparent; background-color: inherit;">Fioriアプリの動作確認で、2通りのトラブルシューティングをする</span></span><span
class="EOP SCXW87260372 BCX0" style="margin: 0px;
Now that it is decoded, it looks more like it was a JSON response. If you used the requests
module to retrieve the data look at response.json()
to see if it decodes correctly, or use json.loads()
on your scraped string.
Upvotes: 1