Reputation: 11
I am trying to read parquet file through pandas, where a few columns do not exist in some files.
I am wondering to know ignore the column existence check in read parquet function.
def column_data(self):
"""_removing all unique value, so that will not less than 2 values_
Returns:
_list_: _returning a list column data_
"""
self._df_list_data = access_data.read_data(self.platform_id, self.equipment_id, self.date_range_df[0].date()) # get the list of data
column_data = []
for col in self._df_list_data.columns: #get the list of columns
if len(self._df_list_data.loc[:, col].unique()) <= 5: #check if column used more than double
self._df_list_data.drop(col, axis=1, inplace=True)
column_data = list(self._df_list_data.columns.values)
return column_data
I read the parquet file all the columns then store in a list data .
if columns is None:
data_list = pd.read_parquet(io.BytesIO(blob_data.readall()),engine='fastparquet') #run with default column list
else:
data_list = pd.read_parquet(io.BytesIO(blob_data.readall()),columns=columns,engine='fastparquet') #run without default column list
return data_list
The first condition is to get the data then call the column store in column_data()
The second condition is calling the parquet file based on the date, but there few date where some columns isn't exist.
So, the system isn't able to match the column data.
How could I ignore the column existence in pandas read parquet function?
Upvotes: 1
Views: 1219
Reputation: 28684
Fastparquet already supports simple "schema evolution" automatically:
In [3]: df1 = pd.DataFrame({"a": [1, 2, 3], "b": ["oi", "hi", "ho"]})
In [4]: df2 = pd.DataFrame({"a": [1, 2, 3]})
In [5]: mkdir parqs
In [7]: df1.to_parquet("parqs/1.parquet", engine="fastparquet")
In [8]: df2.to_parquet("parqs/2.parquet", engine="fastparquet")
In [9]: pd.read_parquet("parqs", engine="fastparquet")
Out[9]:
a b
0 1 oi
1 2 hi
2 3 ho
3 1 None
4 2 None
5 3 None
or you can supply columns=
to select only the columns you want or dtypes=
to specify the schema you expect.
Upvotes: 1
Reputation: 1913
As far as I know there is no option to ignore the column existence check in Pandas. What you could do to work around it is wrap your code in a try-catch block and deal with the missing columns error there. One options would be to specify default values that you can put into the missing columns:
import pandas as pd
import io
default_values = {
# Replace 'column1' and 'column2' with the actual column name
'column1': 0,
'column2': 'N/A'
}
try:
data_list = pd.read_parquet(io.BytesIO(blob_data.readall()), engine='fastparquet')
except ValueError as e:
print(f"Error reading Parquet file: {e}")
for column, default_value in default_values.items():
if column not in data_list.columns:
data_list[column] = default_value
You could also just drop the columns that cause the problem:
try:
data_list = pd.read_parquet(io.BytesIO(blob_data.readall()), columns=columns, engine='fastparquet')
except ValueError as e:
print(f"Error reading Parquet file: {e}")
for column in columns:
if column not in data_list.columns:
print(f"Removing column: {column}")
data_list.drop(column, axis=1, inplace=True)
Upvotes: 2