Reputation: 121
I'm very newbish with Python, and I'm trying to get the source code of web pages to work with their HTML elements.
But, when I convert bytes to utf-8
, some of the HTML code just disappears. Here is my code:
import urllib.request
req = urllib.request.Request('http://avast.softonic.com/')
response = urllib.request.urlopen(req)
the_page = response.read()
For example, the content of the DIV which ID is "review_data" in "the_page" is:
\n\n\t\t\t\t\t\t\t\t\t\t<div id="review_data" class="track_links">\n\t\t\t\t\t\t\t\t\t\t\t\t<p><!--[lead]-->Los expertos en soluciones antivirus gratuitas conocen bien el Avast Free Antivirus 2016, y probablemente ya lo hayan instalado alguna vez. Este software es <strong>uno de los l\xc3\xadderes en su campo</strong>, proporcionando un s\xc3\xb3lido conjunto de defensas contra virus y malware, as\xc3\xad como algunas otras herramientas \xc3\xbatiles que ni se imagina. Mejor a\xc3\xban, <strong>Avast es uno de los antivirus menos intrusivos</strong>, quiz\xc3\xa1 no tanto en los \xc3\xbaltimos a\xc3\xb1os, pero sigue siendo un sistema mucho menos acaparador que los dos grandes antivirus.\r<br /><!--[/lead]--></p>\r<p><!--[features]--><!--[subfeatures]--><h3>Lleno de caracter\xc3\xadsticas.</h3><!--[/subfeatures]--></p>\r<p>Una gran ventaja del Avast Free Antivirus 2016 es su conjunto de caracter\xc3\xadsticas. Aunque estas caracter\xc3\xadsticas han provocado que el tama\xc3\xb1o de instalaci\xc3\xb3n sea mayor (se recomienda hasta 2 GB de espacio de disco duro disponible), no deber\xc3\xada resultar un problema para la mayor\xc3\xada de los discos duros modernos, adem\xc3\xa1s incluye gran cantidad de herramientas de forma gratuita. Aparte de la exploraci\xc3\xb3n antivirus est\xc3\xa1ndar, que se mantiene firme con<strong> actualizaciones peri\xc3\xb3dicas</strong>, la \xc3\xbaltima versi\xc3\xb3n de Avast tiene la seguridad de red dom\xc3\xa9stica que detecta vulnerabilidades para todos los dispositivos conectados a la red. <strong>La \xc3\xbaltima versi\xc3\xb3n, la actualizaci\xc3\xb3n \'Nitro\', tambi\xc3\xa9n a\xc3\xb1ade un navegador dedicado llamado Avast SafeZone</strong>. Aclamado como el navegador m\xc3\xa1s seguro del mundo, es a la vez un software inflado con car\xc3\xa1cter gratuito. Para aquellos a los que les importa la seguridad, especialmente en lo que se refiere a cuestiones bancarias, el programa resulta ser una bendici\xc3\xb3n. El <strong>bloqueador de anuncios incorporado</strong> puede ser un regalo del cielo a la hora de visitar ciertos sitios. Otra nueva caracter\xc3\xadstica es Cybercapture, lo que pone en cuarentena los archivos entrantes sospechosos. Las v\xc3\xadctimas de los virus sabr\xc3\xa1n la importancia de este buffer.\r<br /><!--[/features]--></p>\r<p><!--[usability]--><!--[subusability]--><h3>Una interfaz sencilla y eficaz</h3><!--[/subusability]--></p>\r<p>Avast ha cambiado varias veces a lo largo de los a\xc3\xb1os y la actualizaci\xc3\xb3n Nitro no es una excepci\xc3\xb3n, pero por suerte su dise\xc3\xb1o parece haber permanecido constante. El programa es <strong>simple y f\xc3\xa1cil de usar, con botones definidos y textos claros</strong> en colores agradables. Avast Free Antivirus 2016 se asentar\xc3\xa1 en la bandeja del sistema hasta que se necesite, al igual que la mayor\xc3\xada del software antivirus, se expande cuando se abre en una ventana peque\xc3\xb1a sin fronteras con apariencia elegante y coincide con el esquema de dise\xc3\xb1o de Windows 10. La mayor\xc3\xada de las secciones de este programa son bastante f\xc3\xa1ciles de seguir, con un gran conjunto de botones para las herramientas e iconos est\xc3\xa1ndar, como una rueda dentada para acceder a la configuraci\xc3\xb3n. Por supuesto, siempre puedes actualizar pulsando el bot\xc3\xb3n premium, anim\xc3\xa1ndole a descargar y pagar por Avast Premier. Sin embargo, esto no es obligatorio. Cada una de las principales caracter\xc3\xadsticas de Avast tiene su propia secci\xc3\xb3n, tales como la seguridad de Internet, el navegador SafeZone y la exploraci\xc3\xb3n inteligente, as\xc3\xad que realmente nada puede ir mal.\r<br /><!--[/usability]--></p>\r<p><!--[conclusion]--><!--[subconclusion]--><h3>Las mejores cosas de la vida son gratis</h3><!--[/subconclusion]--></p>\r<p>Para un programa gratuito, <strong>Avast es realmente excelente</strong>. S\xc3\xad, se ha perdido algo de su sensaci\xc3\xb3n m\xc3\xa1s independiente de ediciones pasadas, pero eso es solo un peque\xc3\xb1o precio para un software libre de estas caracter\xc3\xadsticas. Avast Free Antivirus 2016 es menos intrusivo en su navegaci\xc3\xb3n diaria y es muy sencillo de utilizar, por lo que sigue siendo una de las principales soluciones gratuitas.\r<br /><!--[/conclusion]--></p>\n\t\t\t\t\t\t\t\t\t\t\t</div>
But when I try to do any of the following things:
import urllib.request
req = urllib.request.Request('http://avast.softonic.com/')
response = urllib.request.urlopen(req)
the_page = response.read()
html_missing_elements = the_page.decode('utf-8')
Or:
import requests
r =requests.get('http://avast.softonic.com/')
html_missing_elements = r.text
Or:
import urllib.request
from bs4 import BeautifulSoup
req = urllib.request.Request('http://avast.softonic.com/')
response = urllib.request.urlopen(req)
the_page = response.read()
html_missing_elements = BeautifulSoup(the_page)
Following the example, the DIV with the ID "review_data" contains only:
<div id="review_data" class="track_links"><br /><!--[/conclusion]--></p></div>
I can't get the full original HTML code of the page, there is code missing and I want to know why.
Thanks.
Upvotes: 2
Views: 473
Reputation: 180481
There are a some carriage returns i.e \r
embedded in the html:
\r<br /><!--[/lead]--></p>\r
>\r<p>A big plus point for Avast Free Antivirus 2016
and many more.
Once you remove that everything will work fine in your IDE and you can see the tag content when you print it:
soup = BeautifulSoup(r.content.replace(b"\r",b""))
print(soup.select_one("#review_data"))
The data is actually there, your IDE is just not showing it because of the carriage returns:
soup = BeautifulSoup(r.content,"lxml")
print(soup.select_one("#review_data"))
using pycharm will output:
<div class="track_links" id="review_data">
<br/><!--[/conclusion]--></p>
</div>
But using:
print(soup.select_one("#review_data").text)
Will output:
\nConnoisseurs of free antivirus solutions will already know of Avast Free Antivirus 2016 and have probably installed it at some point or another. This software is one of the leaders in its field, providing a robust suite of defences against viruses and malware, as well as some other useful tools that you might not expect. Better still, Avast is one of the less intrusive antivirus programs- perhaps less so in recent years, but still a lot less system-hogging than the big two.\r Brimming with features A big plus point for Avast Free Antivirus 2016 is its suite of features. Although these features have caused its install size to increase (up to 2GB hard drive space is recommended!), it shouldn’t prove an issue for most modern hard drives and you do get a lot of tools for free. Aside from the standard antivirus scanning, which is kept sharp with constant updates, the latest version of Avast has home network security which detects vulnerabilities for all devices connected to your network. The latest version, the ‘Nitro’ update, also adds a dedicated Avast browser called SafeZone. Heralded as the world’s safest browser, this could equally be argued as bloatware and a great free feature. For those who are security conscious, especially regarding banking, it should be seen as beneficial. The in-built ad blocker can be a godsend when visiting certain sites. Another new feature is CyberCapture, which quarantines any suspicious incoming files. Victims of viruses will know the importance of this buffer.\r A simple and effective interface Avast has changed a few times over the years and the Nitro update is no different, but thankfully their design approach seems to have remained constant. The program is simple and straightforward to use, with bold buttons and clear text in friendly colours. Avast Free Antivirus 2016 will sit in the system tray until needed, like most antivirus software, then expand when opened into a small borderless window that looks sleek matching the Windows 10 design scheme. Most sections of this are easy enough to follow, with a large set of buttons for the tools and standard icons like a cog for accessing settings. Of course, you’re also never far away from a premium upgrade button, encouraging you to download and pay for Avast Premier. However, this is not forced upon you. Each of the main features of Avast has its own section, such as internet security, the SafeZone browser and Smart Scan, so you really can’t go wrong.\r The best things in life are free For a free program, Avast is pretty impressive. Yes, it has lost some of its independent feel as the years have gone by, but that’s a small price for a great bit of free software. Avast Free Antivirus 2016 will interfere with your everyday browsing less than the bigger names in software. It’s very simple to use, therefore remains one of the top free solutions.\r\n'
If you were to run the same code using ipython, you would see the correct output just using soup = BeautifulSoup(r.content,"lxml")
:
In [5]: soup = BeautifulSoup(r.content,"lxml")
In [6]: soup.select_one("#review_data")
Out[6]:
<div class="track_links" id="review_data">
<p><!--[lead]-->Connoisseurs of free antivirus solutions will already know of Avast Free Antivirus 2016 and have probably installed it at some point or another. This software is one of the leaders in its field, providing a <strong>robust suite of defences against viruses and malware</strong>, as well as some other useful tools that you might not expect. Better still, Avast is one of the less intrusive antivirus `
<br/><!--[/lead]--></p> <p><!--[features]--><!--[subfeatures]--></p><h3>Brimming with features</h3><!--[/subfeatures]--> <p>A big plus point for Avast Free Antivirus 2016 is its suite of features. Although these features have caused its install size to increase (up to 2GB hard drive space is recommended!), it shouldn’t prove an issue for most modern hard drives and you do get a lot of tools for free.</p> <p>Aside from the standard antivirus scanning, which is kept sharp with constant updates, the latest version of Avast has <strong>home network security</strong> which detects vulnerabilities for all devices connected to your network.</p> <p>The latest version, the ‘Nitro’ update, also adds a dedicated Avast browser called <strong>SafeZone</strong>. Heralded as the world’s safest browser, this could equally be argued as bloatware and a great free feature. For those who are security conscious, especially regarding banking, it should be seen as beneficial. The in-built ad blocker can be a godsend when visiting certain sites. Another new feature is <strong>CyberCapture</strong>, which quarantines any suspicious incoming files. Victims of viruses will know the importance of this buffer.
<br/><!--[/features]--></p> <p><!--[usability]--><!--[subusability]--></p><h3>A simple and effective interface</h3><!--[/subusability]--> <p>Avast has changed a few times over the years and the <strong>Nitro update</strong> is no different, but thankfully their design approach seems to have remained constant. The program is <strong>simple and straightforward</strong> to use, with bold buttons and clear text in friendly colours.</p> <p>Avast Free Antivirus 2016 will sit in the system tray until needed, like most antivirus software, then expand when opened into a small borderless window that looks sleek matching the Windows 10 design scheme. Most sections of this are easy enough to follow, with a large set of buttons for the tools and standard icons like a cog for accessing settings.</p> <p>Of course, you’re also never far away from a premium upgrade button, encouraging you to download and pay for <a href="http://avast-premier-antivirus.en.softonic.com" title="Avast Premier">Avast Premier</a>. However, this is not forced upon you.</p> <p>Each of the main features of Avast has its own section, such as <strong>internet security</strong>, the SafeZone browser and <strong>Smart Scan</strong>, so you really can’t go wrong.
<br/><!--[/usability]--></p> <p><!--[conclusion]--><!--[subconclusion]--></p><h3>The best things in life are free</h3><!--[/subconclusion]--> <p>For a free program, Avast is pretty impressive. Yes, it has lost some of its independent feel as the years have gone by, but that’s a small price for a great bit of free software. Avast Free Antivirus 2016 will interfere with your everyday browsing less than the bigger names in software. It’s very simple to use, therefore remains <strong>one of the top free solutions</strong>.
<br/><!--[/conclusion]--></p>
</div>
it has nothing to do with encoding, it is simply the carriage returns interfering with the output wherever you are running the code from. Running a simple example below you can see how the output can be effected:
In [14]: s = "foo\bar"
In [15]: print(s)
foar
Upvotes: 1