raphael75
raphael75

Reputation: 3218

pandas data types changed when reading from parquet file?

I am brand new to pandas and the parquet file type. I have a python script that:

  1. reads in a hdfs parquet file
  2. converts it to a pandas dataframe
  3. loops through specific columns and changes some values
  4. writes the dataframe back to a parquet file

Then the parquet file is imported back into hdfs using impala-shell.

The issue I'm having appears to be with step 2. I have it print out the contents of the dataframe immediately after it reads it in and before any changes are made in step 3. It appears to be changing the datatypes and the data of some fields, which causes problems when it writes it back to a parquet file. Examples:

It appears that it is actually changing these values, because when it writes the parquet file and I import it into hdfs and run a query, I get errors like this:

WARNINGS: File '<path>/test.parquet' has an incompatible Parquet schema for column 
'<database>.<table>.tport'. Column type: INT, Parquet schema:
optional double tport [i:1 d:1 r:0]

I don't know why it would alter the data and not just leave it as-is. If this is what's happening, I don't know if I need to loop over every column and replace all these back to their original values, or if there is some other way to tell it to leave them alone.

I have been using this reference page: http://arrow.apache.org/docs/python/parquet.html

It uses

pq.read_table(in_file) 

to read the parquet file and then

df = table2.to_pandas()

to convert to a dataframe that I can loop through and change the columns. I don't understand why it's changing the data, and I can't find a way to prevent this from happening. Is there a different way I need to read it than read_table?

If I query the database, the data would look like this:

tport
0
1

My print(df) line for the same thing looks like this:

tport
0.00000
nan
nan
1.00000

Here is the relevant code. I left out the part that processes the command-line arguments since it was long and it doesn't apply to this problem. The file passed in is in_file:

import sys, getopt
import random
import re
import math

import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import pyarrow as pa
import os.path

# <CLI PROCESSING SECTION HERE>

# GET LIST OF COLUMNS THAT MUST BE SCRAMBLED
field_file = open('scrambler_columns.txt', 'r') 
contents = field_file.read()
scrambler_columns = contents.split('\n')

def scramble_str(xstr):
    #print(xstr + '_scrambled!')
    return xstr + '_scrambled!'

parquet_file = pq.ParquetFile(in_file)
table2 = pq.read_table(in_file)
metadata = pq.read_metadata(in_file)

df = table2.to_pandas() #dataframe

print('rows: ' + str(df.shape[0]))
print('cols: ' + str(df.shape[1]))

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

#df.fillna(value='', inplace=True) # np.nan # \xa0 

print(df) # print before making any changes


cols = list(df)
# https://pythonbasics.org/pandas-iterate-dataframe/
for col_name, col_data in df.iteritems():
    #print(cols[index])
    if col_name in scrambler_columns:
        print('scrambling values in column '  + col_name)

        for i, val in col_data.items():
            df.at[i, col_name] = scramble_str(str(val))

        
    
print(df) # print after making changes

print(parquet_file.num_row_groups)
print(parquet_file.read_row_group(0))

# WRITE NEW PARQUET FILE
new_table = pa.Table.from_pandas(df)
writer = pq.ParquetWriter(out_file, new_table.schema)
for i in range(1):
    writer.write_table(new_table)

writer.close()
    
if os.path.isfile(out_file) == True:
    print('wrote ' + out_file)
else:
    print('error writing file ' + out_file)

# READ NEW PARQUET FILE
table3 = pq.read_table(out_file)
df = table3.to_pandas() #dataframe
print(df)

EDIT Here are the datatypes for the 1st few columns in hdfs enter image description here

and here are the same ones that are in the pandas dataframe:

id         object 
col1       float64
col2       object 
col3       object 
col4       float64
col5       object 
col6       object 
col7       object 

It appears to convert

String to object
Int    to float64
bigint to float64

How can I tell pandas what data types the columns should be?

Edit 2: I was able to find a workaround by directly processing the pyarrow tables. Please see my question and answers here: How to update data in pyarrow table?

Upvotes: 5

Views: 15512

Answers (1)

Pace
Pace

Reputation: 43787

fields that show up as NULL in the database are replaced with the string "None" (for string columns) or the string "nan" (for numeric columns) in the printout of the dataframe.

This is expected. It's just how pandas print function is defined.

It appears to convert String to object

This is also expected. Numpy/pandas does not have a dtype for variable length strings. It's possible to use a fixed-length string type but that would be pretty unusual.

It appears to convert Int to float64

This is also expected since the column has nulls and numpy's int64 is not nullable. If you would like to use Pandas's nullable integer column you can do...

def lookup(t):
  if pa.types.is_integer(t):
    return pd.Int64Dtype()

df = table.to_pandas(types_mapper=lookup)

Of course, you could create a more fine grained lookup if you wanted to use both Int32Dtype and Int64Dtype, this is just a template to get you started.

Upvotes: 4

Related Questions