hedlesschkn
hedlesschkn

Reputation: 31

Windows adding a bunch of whitespace/newlines to an html file write in python using request

Using the following code, I end up with one or more newlines between each and every line in my file when running the code on windows (in jupyter notebook on python3) but NOT when running on mac or Linux?

I assume it's some kind of encoding issue? something to do with window's "/r/n" shenanigans? doing a ;str(page.content)instead leaves me with a file full of/r/n` as expected but I'm not sure why it's chalk full of newlines to begin with?

note: I have commented out a quick way to remove whitespace but it's a bit of a hack and not really what I'm after, i'm more looking for why the whitespace is being added to begin with.

import requests

url = 'https://stackoverflow.com/questions/3030487/is-there-a-way-to-get-the-xpath-in-google-chrome'
page=requests.get(url)

newhtml = page.text

# import re
# newhtml = re.sub(r'\s\s+', ' ', page.text)

f = open('webpage.html', 'w', encoding='utf-8')
f.write(newhtml)
f.close()

Result Sample:

<html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">



<head>



    <title>Is there a way to get the xpath in google chrome? - Stack Overflow</title>

    <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">

    <link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">

    <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">

    <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">





    <meta property="og:type" content= "website" />

    <meta property="og:url" content="https://stackoverflow.com/questions/3030487/is-there-a-way-to-get-the-xpath-in-google-chrome"/>

    <meta property="og:site_name" content="Stack Overflow" />

Upvotes: 2

Views: 358

Answers (1)

hedlesschkn
hedlesschkn

Reputation: 31

Looks like C14L nailed it. (how do I give you internet points as a comment, can only do that as an answer, right?)

I switched over to f = open('webpage.html', 'wb', encoding='utf-8') and it complained

ValueError: binary mode doesn't take an encoding argument

so made that f = open('webpage.html', 'wb') which complained

TypeError: a bytes-like object is required, not 'str'

so I switched up newhtml = page.textto newhtml = page.content and voila, the output is as expected. now to test and see that it doesn't break anything running on mac/Linux

Final functional code:

import requests

url = 'https://stackoverflow.com/questions/3030487/is-there-a-way-to-get-the-xpath-in-google-chrome'
page=requests.get(url)

newhtml = page.content

f = open('webpage.html', 'wb')
f.write(newhtml)
f.close()

Upvotes: 1

Related Questions