Reputation: 691
I have a .csv file with 4 columns: 2 integer columns, 1 byte array column and a date column. This byte array column has a binary text that I need do decode to a normal utf-8 string.
Here how my .csv looks like:
id1 id2 text date
1 2 0x202020312045584D4F2841292E205 2020-01-01
3 4 0x20312020455843454C454E 2020-05-01
When I simply use pd.read_csv():
df = pd.read_csv(file_path + file_name)
output:
id1 id2 text date
24228 35649098 0x202020312045584D4F2841292E2 2020-05-04
24298 97780137 0x20312020455843454C454E54C38 2020-05-04
df.info():
id1 994 non-null int64
id2 994 non-null int64
text 994 non-null object
date 994 non-null object
However, I need the normal string, so I tried decoding only this column, but I can't make it work. Here is what I have already tried:
Trial 1:
df.loc[:,'transformedText'] = df.text.str.decode('utf-8')
output: transformedText column comes all as NaN
Trial 2:
df.loc[:,'transformedText'] = df.text.str.encode('utf-8').str.decode('utf-8')
output: transformedText column keeps the byte array string
Trial 3:
df.loc[:,'transformedText'] = df.text.str.encode('ascii').str.decode('utf-8')
output: transformedText column keeps the byte array string
In order to investigate the problem more, I checked what happend when I just encoded the string: df.loc[:,'transformedText'] = df.text.str.encode('ascii')
Output: All it does is add a b' ' on my string (e.g b'0x202020312045584D4F2841292E2')
I believe the reason the decoding doesn't work is because read_csv is not recognizing my column as a byte array column, but as a string column. Although, I am not sure about this.
The ouput that I need is:
id1 id2 text date
24228 35649098 A normal string that a human can read 1 2020-05-04
24298 97780137 A normal string that a human can read 2 2020-05-04
Also, I am kind of new with the binary files, so anything helps!
I have already checked these links bellow, but couldn't find an answer:
https://www.programiz.com/python-programming/methods/string/encode
https://www.geeksforgeeks.org/convert-binary-to-string-using-python/
https://www.kite.com/python/answers/how-to-convert-binary-to-string-in-python
Convert string to binary in python
Convert bytes to a string
Upvotes: 2
Views: 2802
Reputation: 9701
The aim of this answer is to (hopefully) fill in potential the gaps of understanding as to why your attempts and obvious solid effort did not result in a solution.
The current excellent answers by @PierreD and @Ralubrusto explain the solution nicely.
The question suggests you were working (and researching) how to decode the 'binary' string. The catch is, neither string was a binary string, but rather a hexidecimal (hex) string. Hence the title - you were unfortunately looking in the wrong place.
What's the difference? A hex string is a means to represent binary data.
0
, and 1
0-9
, and A-F
. Which can be 'counted' as 0, 1, ..., 8, 9, A, B, ..., E, F
.An 8-bit binary code is just that, eight characters (01001001
). Whereas the equivalent 8-bit hex code (49
) is two characters. One of the hex strings you provided was 28 characters. If this was converted to binary, it would have been 112 characters! So the hex system is a means to 'shorten' a binary string for easier representation.
This short page provides a simple comparison of the two data types.
The purpose of this example is to show why the .encode()
and .decode()
functions only did part of the job.
It's easy to convert a hex string into an ASCII string by hand using the following steps:
This example shows the transformation by hand:
0x49 (hex) --> 0100 1001 (binary) --> 73 (decimal) --> I (ascii)
# .encode() stops here ^^^
A binary string b'mystring'
or b'49'
and a hex string '0x49'
are different things; where the hex string can be a representation of a binary value.
I hope this helps to explain a bit behind why your attempts unfortunately did not work - despite the solid effort.
Upvotes: 2
Reputation: 1501
It seems that there is some inconsistency between the data in your CSV and the data pandas
has read. This is the first line you shared from your file:
1 2 0x202020312045584D4F2841292E205 2020-01-01
Notice that the string is encoded as hexadecimal (starting with 0x
) and, as so, it needs an even number of digits to be correctly decoded. The example above has 29 digits (not including the 0x
) which means it can't be correctly decoded as it is.
I noticed, however, that there might be a problem while loading your code (or it was just a mistyping). Both rows of binary strings start with the same figures before and after the read_csv
, but they have different final figures. See the first one:
0x202020312045584D4F2841292E205 # Before
0x202020312045584D4F2841292E2 # After
And the second one:
0x20312020455843454C454E # Before
0x20312020455843454C454E54C38 # After
Also, the dates and the id
columns are different before and after loading into pandas. It would be interesting to check if pandas is loading your data correctly.
Anyway, if you have hex strings of data, you can decode them by doing:
df['Decoded'] = df['text'].str[2:].apply(lambda s: bytes.fromhex(s).decode('utf-8')
If pandas read this column as numbers instead, you can do:
df['Decoded'] = df['text'].apply(lambda s: bytes.fromhex(hex(s)[2:]).decode('utf-8')
It've succeded decoding your texts for the following binary strings:
0x202020312045584D4F2841292E20 # First row without the last digit
0x20312020455843454C454E # Second row
Which returns:
0 1 EXMO(A).
1 1 EXCELEN
Both are "almost" words in portuguese: EXMO(A)
is a short for Excelentíssimo(a), while EXCELEN
is an incomplete word, that can be Excelência, Excelente or something similar (I am brazilian as well, so it was nice to see some non-english words beeing decoded).
As you can see, there is some problem with your data, but we could somehow decode parts of it. Let us know if you have any further issues and if you succeeded decoding your texts after all.
Upvotes: 1
Reputation: 26211
Make sure you actually have proper hex strings (some of your examples have odd numbers of hex digits, which will make the code below fail).
def hextostr(v):
if v.startswith('0x'): v = v[2:]
b = bytes.fromhex(v.lower())
return b.decode('utf-8')
Then:
df['text'] = df['text'].apply(hextostr)
Upvotes: 1