Reputation: 49
I've been stuck on this for way too long. I tried to decode the byte object received from the request. When I try to decode to UTF-8 and print, I don't see the string representation of the byte object. What am I missing here?
import urllib.request
url = 'https://www2.census.gov/geo/docs/reference/codes/files/national_cousub.txt'
data = urllib.request.urlopen(url)
counter = 0
for line in data:
print('byte string:')
print(line)
print('after decoding:')
print(line.decode('utf-8'))
counter = counter + 1
if counter > 5:
break
This is what I see on console:
byte string:
b'STATE,STATEFP,COUNTYFP,COUNTYNAME,COUSUBFP,COUSUBNAME,FUNCSTAT\r\r\n'
after decoding:
byte string:
b'AL,01,001,Autauga County,90171,Autaugaville CCD,S\r\r\n'
after decoding:
byte string:
b'AL,01,001,Autauga County,90315,Billingsley CCD,S\r\r\n'
after decoding:
byte string:
b'AL,01,001,Autauga County,92106,Marbury CCD,S\r\r\n'
after decoding:
byte string:
b'AL,01,001,Autauga County,92628,Prattville CCD,S\r\r\n'
after decoding:
byte string:
b'AL,01,003,Baldwin County,90207,Bay Minette CCD,S\r\r\n'
after decoding:
I am on Windows 10. Python version 3.5.5. I install python via anaconda. I am running this in PyCharm.
sys.stdout.encoding
= 'UTF-8'
Same results with print(line.decode('utf-8'), file=sys.stderr)
Upvotes: 1
Views: 2546
Reputation: 365915
Your strings all end with \r\r\n
. This is wrong, but (a) it's not your fault but the census website's fault, and (b) it shouldn't be causing this problem.
Assuming you're on Windows, the \r\n
at the end is a normal newline. But the extra \r
before it, without a \n
, is a carriage return that moves the cursor back to the start of the current line. Then printing the \r\n
newline is overwriting the rest of the line.
That last part is what shouldn't happen. Printing a newline should just move to the next line. You can see that by running this program at the Windows command line, in a macOS or Linux terminal, or on repl.it.
But you're running in PyCharm, with your output going to the PyCharm debugging console. The PyCharm debugging console doesn't work like a complete terminal emulator, and on of the differences is, apparently, that it handles \r
strangely. This question has more information about that. (And the same thing happens in other JetBrains IDEs, like printing the same text with Java in IntelliJ, just as you'd expect.)
There doesn't seem to be a fix for the debugging console; that's just how it works.
You can send output to PyCharm's terminal output instead of its debugging window, or run the program in its terminal, or use your Windows command prompt instead of PyCharm, or use a different IDE… but all of those mean you can't use the PyCharm debugging console for debugging, which may not be a tradeoff worth having.
If you want to work around the problem without changing your setup, the simplest solution is to remove those extra \r
characters:
print(line.decode('utf-8').replace('\r\r\n', \r\n'))
Or, better, as suggested by aldo in the comments, call either strip
or rstrip
to remove all those newline-ish characters. If you want the line to end with a proper newline (so you still get a blank line after each line):
print(line.decode('utf-8').rstrip()+'\n')
And if you don’t, it’s even simpler:
print(line.decode('utf-8').rstrip())
Upvotes: 3