Reputation: 691

Pandas: How to read bytes and non-bytes columns from CSV and decode the bytes column?

I have a .csv file with 4 columns: 2 integer columns, 1 byte array column and a date column. This byte array column has a binary text that I need do decode to a normal utf-8 string.

Here how my .csv looks like:

id1   id2   text                                       date
 1     2    0x202020312045584D4F2841292E205           2020-01-01
 3     4    0x20312020455843454C454E                  2020-05-01

When I simply use pd.read_csv():

df = pd.read_csv(file_path + file_name)

output:

id1       id2              text                   date
24228   35649098    0x202020312045584D4F2841292E2   2020-05-04
24298   97780137    0x20312020455843454C454E54C38   2020-05-04

df.info():

id1     994 non-null int64
id2     994 non-null int64
text    994 non-null object
date    994 non-null object

However, I need the normal string, so I tried decoding only this column, but I can't make it work. Here is what I have already tried:

Trial 1:

df.loc[:,'transformedText'] = df.text.str.decode('utf-8')

output: transformedText column comes all as NaN

Trial 2:

df.loc[:,'transformedText'] = df.text.str.encode('utf-8').str.decode('utf-8')

output: transformedText column keeps the byte array string

Trial 3:

df.loc[:,'transformedText'] = df.text.str.encode('ascii').str.decode('utf-8')

output: transformedText column keeps the byte array string

In order to investigate the problem more, I checked what happend when I just encoded the string: df.loc[:,'transformedText'] = df.text.str.encode('ascii')

Output: All it does is add a b' ' on my string (e.g b'0x202020312045584D4F2841292E2')

I believe the reason the decoding doesn't work is because read_csv is not recognizing my column as a byte array column, but as a string column. Although, I am not sure about this.

The ouput that I need is:

id1       id2              text                                date
24228   35649098    A normal string that a human can read 1  2020-05-04
24298   97780137    A normal string that a human can read 2  2020-05-04

Also, I am kind of new with the binary files, so anything helps!

I have already checked these links bellow, but couldn't find an answer:
https://www.programiz.com/python-programming/methods/string/encode
https://www.geeksforgeeks.org/convert-binary-to-string-using-python/
https://www.kite.com/python/answers/how-to-convert-binary-to-string-in-python
Convert string to binary in python
Convert bytes to a string

Upvotes: 2

Answers (3)

s3dev

Reputation: 9701

The aim of this answer is to (hopefully) fill in potential the gaps of understanding as to why your attempts and obvious solid effort did not result in a solution.

The current excellent answers by @PierreD and @Ralubrusto explain the solution nicely.

Looking in the Jungle for a Shark:

The question suggests you were working (and researching) how to decode the 'binary' string. The catch is, neither string was a binary string, but rather a hexidecimal (hex) string. Hence the title - you were unfortunately looking in the wrong place.

What's the difference? A hex string is a means to represent binary data.

Binary: A numeric system consisting of two symbols, 0, and 1
Hexidecimal: A numeric system consisting of 16 symbols (hex (6) + deci (10)), 0-9, and A-F. Which can be 'counted' as 0, 1, ..., 8, 9, A, B, ..., E, F.

An 8-bit binary code is just that, eight characters (01001001). Whereas the equivalent 8-bit hex code (49) is two characters. One of the hex strings you provided was 28 characters. If this was converted to binary, it would have been 112 characters! So the hex system is a means to 'shorten' a binary string for easier representation.

This short page provides a simple comparison of the two data types.

Hex to ASCII:

The purpose of this example is to show why the .encode() and .decode() functions only did part of the job.

It's easy to convert a hex string into an ASCII string by hand using the following steps:

Convert hex to binary (the encode function stops here)
Convert binary to decimal
Lookup the decimal value to a character via the ASCII table

This example shows the transformation by hand:

0x49 (hex) --> 0100 1001 (binary) --> 73 (decimal) --> I (ascii)

    # .encode() stops here ^^^

Summary:

A binary string b'mystring' or b'49' and a hex string '0x49' are different things; where the hex string can be a representation of a binary value.

I hope this helps to explain a bit behind why your attempts unfortunately did not work - despite the solid effort.

Upvotes: 2

Ralubrusto

Reputation: 1501

It seems that there is some inconsistency between the data in your CSV and the data pandas has read. This is the first line you shared from your file:

 1     2    0x202020312045584D4F2841292E205           2020-01-01

Notice that the string is encoded as hexadecimal (starting with 0x) and, as so, it needs an even number of digits to be correctly decoded. The example above has 29 digits (not including the 0x) which means it can't be correctly decoded as it is.

I noticed, however, that there might be a problem while loading your code (or it was just a mistyping). Both rows of binary strings start with the same figures before and after the read_csv, but they have different final figures. See the first one:

0x202020312045584D4F2841292E205  # Before
0x202020312045584D4F2841292E2    # After

And the second one:

0x20312020455843454C454E       # Before
0x20312020455843454C454E54C38  # After

Also, the dates and the id columns are different before and after loading into pandas. It would be interesting to check if pandas is loading your data correctly.

Anyway, if you have hex strings of data, you can decode them by doing:

df['Decoded'] = df['text'].str[2:].apply(lambda s: bytes.fromhex(s).decode('utf-8')

If pandas read this column as numbers instead, you can do:

df['Decoded'] = df['text'].apply(lambda s: bytes.fromhex(hex(s)[2:]).decode('utf-8')

It've succeded decoding your texts for the following binary strings:

0x202020312045584D4F2841292E20 # First row without the last digit
0x20312020455843454C454E  # Second row

Which returns:

0       1 EXMO(A).
1        1  EXCELEN

Both are "almost" words in portuguese: EXMO(A) is a short for Excelentíssimo(a), while EXCELEN is an incomplete word, that can be Excelência, Excelente or something similar (I am brazilian as well, so it was nice to see some non-english words beeing decoded).

As you can see, there is some problem with your data, but we could somehow decode parts of it. Let us know if you have any further issues and if you succeeded decoding your texts after all.

Upvotes: 1

Pierre D

Reputation: 26211

Make sure you actually have proper hex strings (some of your examples have odd numbers of hex digits, which will make the code below fail).

def hextostr(v):
    if v.startswith('0x'): v = v[2:]
    b = bytes.fromhex(v.lower())
    return b.decode('utf-8')

Then:

df['text'] = df['text'].apply(hextostr)