Reputation: 13
So I am reading and manipulate a file with :
base_file = open(path+'/'+base_name, "r")
lines = base_file.readlines()
After this I search and find the "raw_data" start of line.
if re.match("\s{0,100}raw_data: ",line):
split_line = line.split("raw_data:")
print(split_line)
raw_string = split_line[1]
One example of raw_data is:
raw_data: "&\276!\300\307 =\277\"O\271\277vH9?j?\345?@\243\264=\350\034\345\277\260\345\033\300\023\017(@z|\273\277L\}\277\210\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215@\364\305\201\276\361+\202@t:\304\277\344\231\243@\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300>j\210\000@\034\014\220@\231\330J@\223\025\236@\006\332\230\276\227\273\n\277\353@,@\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013@)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y@\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
And raw_string will be
print(raw_data)
"&\276!\300\307 =\277\"O\271\277vH9?j?\345?@\243\264=\350\034\345\277\260\345\033\300\023\017(@z|\273\277L\}\277\210\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215@\364\305\201\276\361+\202@t:\304\277\344\231\243@\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300>j\210\000@\034\014\220@\231\330J@\223\025\236@\006\332\230\276\227\273\n\277\353@,@\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013@)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y@\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
If I tried to read this file I will obtain one char to one char even for escape characters. So, my question is how to transform this plain text to utf-8 string so that I can have one character when reading \300 and not 4 characters.
I tried to pass "encondig =utf-8" in open file method but does not work.
I have made the same example passing raw_data as variable and it works properly.
RAW_DATA = "&\276!\300\307 =\277\"O\271\277vH9?j?\345?@\243\264=\350\034\345\277\260\345\033\300\023\017(@z|\273\277L\\}\277\210\\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215@\364\305\201\276\361+\202@t:\304\277\344\231\243@\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300<I>>j\210\000@\034\014\220@\231\330J@\223\025\236@\006\332\230\276\227\273\n\277\353@,@\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013@)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y@\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
print(f"Qnt -> {len(RAW_DATA)}") # Qnt -> 256
print(type(RAW_DATA))
at = 0
total = 0
while at < len(RAW_DATA):
fin = at+4
substrin = RAW_DATA[at:fin]
resu = FourString_float(substrin)
at = fin
For this example \300 is only one char.
Hope someone can help me.
Upvotes: 0
Views: 214
Reputation: 479
The problem is that on the read file the escape \
symbols are coming in as \
, but in the example you've provided they are being evaluated as part of the numerics that follow it. ie, \276
is read as a single character.
If you run:
RAW_DATA = r"&\276!\300\307 =\277\"O\271\277vH9?j?\345?@\243\264=\350\034\345\277\260\345\033\300\023\017(@z|\273\277L\\}\277\210\\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215@\364\305\201\276\361+\202@t:\304\277\344\231\243@\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300<I>>j\210\000@\034\014\220@\231\330J@\223\025\236@\006\332\230\276\227\273\n\277\353@,@\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013@)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y@\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
print(f"Qnt -> {len(RAW_DATA)}") # Qnt -> 256
print(type(RAW_DATA))
at = 0
total = 0
while at < len(RAW_DATA):
fin = at+4
substrin = RAW_DATA[at:fin]
resu = FourString_float(substrin)
at = fin
You would should be getting the same error that you were getting originally. Notice that we are using the raw-string
literal instead of regular string literal. This will ensure that the \
don't get escaped.
You would need to evaluate the RAW_DATA
to force it to evaluate the \
.
You can do something like RAW_DATA = eval(f'"{RAW_DATA}"')
or
import ast
RAW_DATA = ast.literal_eval(f'"{RAW_DATA}"')
Note, the second option is a bit more secure that doing a straight eval
as you are limiting the scope of what can be executed.
Upvotes: 2