hoffKomar810
hoffKomar810

Reputation: 1

Reading python string file changes format when added to list. Why?

Loading a file with phone numbers stored in txt file. When printing loaded file looks good. Writing the file into a list. When printing the file from a list get a different encoding type, not string. When writing contents of list to a new file get unnecessary \n and other characters despite stripping and ensuring UTF-8 format.

original_file = open("original.txt", "r", encoding="UTF-8", errors="replace")
pl = []
for item in original_file:
    pl.append(item)
target_file = open("target.txt", "w", encoding="UTF-8")
for item in pl:
    target_file.write(item) # or .write("{}\n".format(item)) 
                            # neither gets me the desired new lin

e

original file contents:

(248) 370-0000
(706) 862-2128
(863) 763-8632
(682) 404-0051
(734) 667-2877
...

when loaded to the pl list and print the item

for item in pl: print(item)

I get this:

(248) 370-0000
(706) 862-2128
(863) 763-8632
(682) 404-0051
(734) 667-2877

but when I simply write the list name pl I get this:

'\x00(\x006\x001\x000\x00)\x00 \x003\x009\x002\x00-\x003\x001\x001\x005\x00\n', '\x00(\x002\x001\x004\x00)\x00 \x009\x004\x001\x00-\x003\x008\x004\x001\x00\n', '\x00(\x003\x000\x004\x00)\x00 \x002\x001\x006\x00-\x002\x000\x009\x006\x00\n', '\x00(\x007\x002\x004\x00)\x00 \x003\x003\x007\x00-\x003\x005\x000\x004\x00\n', '\x00(\x002\x004\x008\x00)\x00 \x003\x007\x000\x00-\x000\x000\x000\x000\x00\n', '\x00(\x007\x000\x006\x00)\x00 \x008\x006\x002\x00-\x002\x001\x002\x008\x00\n', '\x00(\x008\x006\x003\x00)\x00 \x007\x006\x003\x00-\x008\x006\x003\x002\x00\n', '\x00(\x006\x008\x002\x00)\x00 \x004\x000\x004\x00-\x000\x000\x005\x001\x00\n', '\x00(\x007\x003\x004\x00)\x00 \x006\x006\x007\x00-\x002\x008\x007\x007\x00']

And I bring this up because when I then try to load the items from pl and write them to the target file instead of getting a list of phone numbers in a new text file I get this:

�3�9�2�-�3�1�1�5��(�2�1�4�)� �9�4�1�-�3�8�4�1��(�3�0�4�)� �2�1�6�-�2�0�9�6��(�7�2�4�)� �3�3�7�-�3�5�0�4��(�2�4�8�)� �3�7�0�-�0�0�0�0��(�7�0�6�)� �8�6�2�-�2�1�2�8��(�8�6�3�)� �7�6�3�-�8�6�3�2��(�6�8�2�)� �4�0�4�-�0�0�5�1��(�7�3�4�)� �6�6�7�-�2�8�7�7�

No new lines. Spaces between items instead.

Upvotes: 0

Views: 500

Answers (1)

snakecharmerb
snakecharmerb

Reputation: 55620

Your original file is encoded as UTF-16, big endian.

>>> bs = b'\x00(\x006\x001\x000\x00)\x00 \x003\x009\x002\x00-\x003\x001\x001\x005\x00\n'
>>> bs.decode('utf-8')
'\x00(\x006\x001\x000\x00)\x00 \x003\x009\x002\x00-\x003\x001\x001\x005\x00\n'
>>> bs.decode('utf-16-be')
'(610) 392-3115\n'

(The presence of a null byte b'\x00' before each ascii character is a strong hint that utf-16 is the encoding)

Opening the file like this ought to work:

original_file = open("original.txt", "r", encoding="utf-16-be")

Upvotes: 1

Related Questions