MADFROST
MADFROST

Reputation: 1723

How to read HTML file without any limit using python?

So I have a HTML file that consist of 4,574 words 57,718 characters.

But recently, when I read it using .read() command it got a limitation and only show 3,004 words 39,248 characters when I export it.

How can I read it and export it fully without any limitation?

This is my python script:

from IPython.display import FileLink, HTML

title = "Download HTML file"
filename = "data.html"

payload = open("./dendo_plot(2).html").read()
payload = payload.replace('"', """)
html = '<a download="{filename}" href="data:text/html;charset=utf-8,'+payload+'" target="_blank">{title}</a>'

print(payload)
HTML(html)

This is what I mean, Left (Source File), Right (Exported File), you can see there were a gap on both file.

enter image description here

Upvotes: 1

Views: 329

Answers (1)

bastantoine
bastantoine

Reputation: 592

I don't think there's a problem here, I think you are simply misinterpreting a variation in a metric between your input and output.

When you call read() on an opened file with no arguments, it reads the whole content of the file (until EOF) and put it in your memory:

To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string [...]. size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory.

From the official Python tutorial

So technically Python might be unable to read the whole file because it is too big to fit in your memory, but I strongly doubt that's what happening here.


I believe the difference in the number of characters and words you see between your input and output are because your data is changed when processed.

Look at: payload = payload.replace('"', "&quot;"). From an HTML validation point of view, both " and &quot; are the same and displayed the same (which is why you can switch them), but from a Python point of view, they are different and have different length:

>>> len('"')
1
>>> len("&quot;")
6

So just with this line you get a variation in your input and output.

That being said, I don't think it is very relevant to use the number of characters and words to check if two pieces of HTML are the same. Take the following example:

>>> first_html = """<div>
...     <p>Hello there</p>
... </div>"""
>>> len(first_html)
32
>>> second_html = "<div><p>Hello there</p></div>"
>>> len(second_html)
29

You would agree that both HTML will display the same thing, but they don't have the same number of characters. The HTML specification is quite tolerant in the usage of spaces, tabulation and new lines, that's why both previous examples are treated as equal by an HTML parser.

About the number of words, one simple question (well not that simple to answer though ^^'): what qualifies as a word in HTML? Is it only the text displayed? Does the HTML tags counts aswell? If so what about their attributes?


So to sum up, I don't think you have a real problem here, only a difference that is a problem from a certain point of view, but not from an other one.

Upvotes: 2

Related Questions