Reputation: 31
Using the following code, I end up with one or more newlines between each and every line in my file when running the code on windows (in jupyter notebook on python3) but NOT when running on mac or Linux?
I assume it's some kind of encoding issue? something to do with window's "/r/n
" shenanigans? doing a ;str(page.content)instead leaves me with a file full of
/r/n` as expected but I'm not sure why it's chalk full of newlines to begin with?
note: I have commented out a quick way to remove whitespace but it's a bit of a hack and not really what I'm after, i'm more looking for why the whitespace is being added to begin with.
import requests
url = 'https://stackoverflow.com/questions/3030487/is-there-a-way-to-get-the-xpath-in-google-chrome'
page=requests.get(url)
newhtml = page.text
# import re
# newhtml = re.sub(r'\s\s+', ' ', page.text)
f = open('webpage.html', 'w', encoding='utf-8')
f.write(newhtml)
f.close()
Result Sample:
<html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">
<head>
<title>Is there a way to get the xpath in google chrome? - Stack Overflow</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
<meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">
<meta property="og:type" content= "website" />
<meta property="og:url" content="https://stackoverflow.com/questions/3030487/is-there-a-way-to-get-the-xpath-in-google-chrome"/>
<meta property="og:site_name" content="Stack Overflow" />
Upvotes: 2
Views: 358
Reputation: 31
Looks like C14L nailed it. (how do I give you internet points as a comment, can only do that as an answer, right?)
I switched over to f = open('webpage.html', 'wb', encoding='utf-8')
and it complained
ValueError: binary mode doesn't take an encoding argument
so made that f = open('webpage.html', 'wb')
which complained
TypeError: a bytes-like object is required, not 'str'
so I switched up newhtml = page.text
to newhtml = page.content
and voila, the output is as expected. now to test and see that it doesn't break anything running on mac/Linux
Final functional code:
import requests
url = 'https://stackoverflow.com/questions/3030487/is-there-a-way-to-get-the-xpath-in-google-chrome'
page=requests.get(url)
newhtml = page.content
f = open('webpage.html', 'wb')
f.write(newhtml)
f.close()
Upvotes: 1