Jyoti Dhiman
Jyoti Dhiman

Reputation: 568

Why index name always appears in the parquet file created with pandas?

I am trying to create a parquet using pandas dataframe, and even though I delete the index of the file, it is still appearing when I am re-reading the parquet file. Can anyone help me with this? I want index.name to be set as None.

>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df
  key
0    1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
     key
index     
0        1
>>> del df.index.name
>>> df
     key
0    1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
     key
index     
0        1

Upvotes: 6

Views: 15879

Answers (3)

Liam Gower
Liam Gower

Reputation: 324

Hey this works with pyarrow with the following

df = pd.DataFrame({'key': 1}, index=[0])
df.to_parquet('test.parquet', engine='pyarrow', index=False)
df = pd.read_parquet('test.parquet', engine='pyarrow')
df.head()

As @alexopoulos7 mentioned in the to_parquet documentation it states you can use the "index" argument as a parameter. It seems to work, perhaps because I'm explicitly stating the engine='pyarrow'

Upvotes: 5

alexopoulos7
alexopoulos7

Reputation: 912

I have been playing with both libraries pyarrow and fastparquet, trying to write a parquet file without preserving indexes since I need those data to be read from redshift as an external table.

For me what it worked was for library fastparquet

df.to_parquet(destination_file, engine='fastparquet', compression='gzip', write_index=False)

If you try to follow the to_parquet official documentation you will see that it mentions parameter "index" but this throws an error if this argument does not exist in the used engine. Currently, I have found that only fastparquet has such an option and in named "write_index"

Upvotes: 2

Jyoti Dhiman
Jyoti Dhiman

Reputation: 568

It works as expected using pyarrow:

>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> del df.index.name
>>> df
   key
0    1
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
       key
index     
0        1 ---> INDEX NAME APPEARS EVEN AFTER DELETING USING fastparquet
>>> del df.index.name
>>> df.to_parquet('test.parquet', engine='pyarrow')
>>> df = pd.read_parquet('test.parquet')
>>> df
   key
0    1 --> INDEX NAME IS NONE WHEN CONVERSION IS DONE WITH pyarrow

Upvotes: 3

Related Questions