soft encoder
soft encoder

Reputation: 11

How could be possible to ignore non exist column from pandas read parquet function

I am trying to read parquet file through pandas, where a few columns do not exist in some files.

I am wondering to know ignore the column existence check in read parquet function.

def column_data(self):
        """_removing all unique value, so that will not less than 2 values_
        Returns:
            _list_: _returning a list column data_
        """
        self._df_list_data = access_data.read_data(self.platform_id, self.equipment_id, self.date_range_df[0].date())  # get the list of data
        
        column_data = []
        for col in self._df_list_data.columns: #get the list of columns
            if len(self._df_list_data.loc[:, col].unique()) <= 5: #check if column used more than double 
                self._df_list_data.drop(col, axis=1, inplace=True)
            column_data = list(self._df_list_data.columns.values)   
        return column_data

I read the parquet file all the columns then store in a list data .

if columns is None:
    data_list = pd.read_parquet(io.BytesIO(blob_data.readall()),engine='fastparquet') #run with default column list
else:
    data_list = pd.read_parquet(io.BytesIO(blob_data.readall()),columns=columns,engine='fastparquet') #run without default column list
return data_list

The first condition is to get the data then call the column store in column_data()

The second condition is calling the parquet file based on the date, but there few date where some columns isn't exist.

So, the system isn't able to match the column data.

How could I ignore the column existence in pandas read parquet function?

Upvotes: 1

Views: 1219

Answers (2)

mdurant
mdurant

Reputation: 28684

Fastparquet already supports simple "schema evolution" automatically:

In [3]: df1 = pd.DataFrame({"a": [1, 2, 3], "b": ["oi", "hi", "ho"]})

In [4]: df2 = pd.DataFrame({"a": [1, 2, 3]})

In [5]: mkdir parqs

In [7]: df1.to_parquet("parqs/1.parquet", engine="fastparquet")

In [8]: df2.to_parquet("parqs/2.parquet", engine="fastparquet")

In [9]: pd.read_parquet("parqs", engine="fastparquet")
Out[9]:
   a     b
0  1    oi
1  2    hi
2  3    ho
3  1  None
4  2  None
5  3  None

or you can supply columns= to select only the columns you want or dtypes= to specify the schema you expect.

Upvotes: 1

Ada
Ada

Reputation: 1913

As far as I know there is no option to ignore the column existence check in Pandas. What you could do to work around it is wrap your code in a try-catch block and deal with the missing columns error there. One options would be to specify default values that you can put into the missing columns:

import pandas as pd
import io

default_values = {
    # Replace 'column1' and 'column2' with the actual column name
    'column1': 0,  
    'column2': 'N/A'
}

try:
    data_list = pd.read_parquet(io.BytesIO(blob_data.readall()), engine='fastparquet')
except ValueError as e:
    print(f"Error reading Parquet file: {e}")

    for column, default_value in default_values.items():
        if column not in data_list.columns:
            data_list[column] = default_value

You could also just drop the columns that cause the problem:

try:
    data_list = pd.read_parquet(io.BytesIO(blob_data.readall()), columns=columns, engine='fastparquet')
except ValueError as e:
    print(f"Error reading Parquet file: {e}")

    for column in columns:
        if column not in data_list.columns:
            print(f"Removing column: {column}")
            data_list.drop(column, axis=1, inplace=True)

Upvotes: 2

Related Questions