Mathieu Dubois
Mathieu Dubois

Reputation: 924

How to store the name of rows and column index in pandas DataFrame?

I have a DataFrame with named rows and columns indexes:

import numpy as np
import pandas as pd

I = pd.Index(["a", "b", "c", "d"], name="rows")
C = pd.Index(["col0", "col1", "col2"], name="cols")
df = pd.DataFrame(data=np.random.rand(4, 3),
                  index=I,
                  columns=C)

I have tried to store it in several formats (Excel, CSV) but when re-reading the file, the names are lost (maybe I have missed some options). Msgpack works but it is marked as experimental so I would prefer to avoid it for now. I would also prefer to avoid pickle. Is there any way (format and option) to store the name of the 2 indexes?

EDIT: I know how to write and read CSV with pandas. The problem is to save the name of the column index and of the row index.

Upvotes: 5

Views: 19305

Answers (4)

Stop harming Monica
Stop harming Monica

Reputation: 12620

You can use hdf.

import numpy as np
import pandas as pd
I = pd.Index(["a", "b", "c", "d"], name="rows")
C = pd.Index(["col0", "col1", "col2"], name="columns")
df = pd.DataFrame(data=np.random.rand(4,3), index=I, columns=C)
print(df)

columns      col0      col1      col2
rows                                 
a        0.098497  0.918954  0.642800
b        0.168266  0.678434  0.455059
c        0.434939  0.244027  0.599400
d        0.877356  0.053085  0.182661

df.to_hdf('test.hdf', 'test')
print(pd.read_hdf('test.hdf'))

columns      col0      col1      col2
rows                                 
a        0.098497  0.918954  0.642800
b        0.168266  0.678434  0.455059
c        0.434939  0.244027  0.599400
d        0.877356  0.053085  0.182661

Upvotes: 8

Borja
Borja

Reputation: 1471

Does your DataFrame here actually have the index name? On my computer I need to write

I = pd.Index(["a", "b", "c", "d"], name="rows")

instead of

I = pd.Index(["a", "b", "c", "d"], "rows")

for the name to be actually assigned. Then using df.to_csv() and pd.read_csv() the index name does get preserved (in the csv the index name is just saved like another column).

If that doesn't work, do you need to be able to open it with programs other than Python? If not, you could use pickle which should leave the object untouched:

import pickle

pickle.dump(df, open("File.pickled", "wb"))

df_read = pickle.load(open("File.pickled", "rb"))

EDIT: If you want the name of the columns index to be preserved with to_csv() as well you could do the following hack:

Save it with (converts the columns index to a row, hence setting its column index name as a row index value)

df.T.reset_index().T.to_csv("DataFrame.csv")

And then read it with either of these two, first one being (very) slightly faster:

df_read = pd.read_csv("DataFrame.csv", index_col=0)
df_read.columns = df_read.loc['cols']
df_read = df_read.drop('cols', axis=0)

or

df_read = pd.read_csv("DataFrame.csv", index_col=0)
df_read = df_read.T.set_index('cols').T

Upvotes: 1

albert
albert

Reputation: 8613

You can export the DataFrame to a csv-file using .to_csv() and read it back in using .read_csv(). I extended the code you already had as follows:

#!/usr/bin/env python3
# coding: utf-8

import numpy as np
import pandas as pd

I = pd.Index(["a", "b", "c", "d"], "rows")
C = pd.Index(["col0", "col1", "col2"], "cols")
df = pd.DataFrame(data=np.random.rand(4,3), index=I, columns=C)

# export DataFrame to csv
df.to_csv('out.csv')

# set index_col in order to read first column as indices
df_in = pd.read_csv('out.csv', index_col=0)

So the DataFrame df looks like this:

       col0      col1      col2
a  0.590016  0.834033  0.535310
b  0.421589  0.897302  0.029500
c  0.373580  0.109005  0.239181
d  0.473872  0.075918  0.751628

The csv-file out.csv looks like this:

,col0,col1,col2
a,0.5900160748408918,0.8340332218911729,0.5353103406507513
b,0.42158899389955884,0.8973015040807538,0.029500416731096046
c,0.37357951184145965,0.10900495955642386,0.2391805787788026
d,0.47387186813644167,0.07591794371425187,0.7516279365972057

Reading the data back in leads to the DataFrame df_in as follows:

       col0      col1      col2
a  0.590016  0.834033  0.535310
b  0.421589  0.897302  0.029500
c  0.373580  0.109005  0.239181
d  0.473872  0.075918  0.751628

So df2 is exactly the same as df which shows that export and the desired import is working as expected.

EDIT to export column and index names:

df.to_csv('out.csv', index_label=[df.index.name, df.columns.name])

However, this makes re-importing a bit difficult since the columns name is added as a additional column. Normally, this is useful for multi-indexed data, but leads to an additional empty column here.

So I would suggest to export the index name, only:

# export DataFrame to csv
df.to_csv('out.csv', index_label=df.index.name)

# set index_col in order to read first column as indices
df_in = pd.read_csv('out.csv', index_col=0)

which leads to df_in as:

          col0      col1      col2
rows                              
a     0.442467  0.959260  0.626502
b     0.639044  0.989795  0.853002
c     0.576137  0.350260  0.532920
d     0.235698  0.095978  0.194151

I do not know why you need to export the names of both index and colums. If you simply want to access the row or column names you can get their label like this:

column_labels = df.columns.get_values()
>>> array(['col0', 'col1', 'col2'], dtype=object)

index_labels = df.index.get_values()
>>> array(['a', 'b', 'c', 'd'], dtype=object)

Upvotes: 2

ehhh
ehhh

Reputation: 36

header=True (for the column names) and index=True (for the index values) in df.to_csv should do the trick. Both are True by default, say the docs. However, you will have to manually set the index after loading, it is being loaded as a regular column. For this, I have added a name to the index:

import numpy as np
import pandas as pd

# your code:
I = pd.Index(["a", "b", "c", "d"], "rows")
C = pd.Index(["col0", "col1", "col2"], "cols")
df = pd.DataFrame(data=np.random.rand(4, 3),
                  index=I,
                  columns=C)

# name the index
I.name = 'index'

print "original:"
print df
print ""

df.to_csv("~/test.csv", index=True, header=True)

df2 = pd.read_csv("~/test.csv").set_index('index')

print "read from disk:"
print df2
print ""

Output:

original:
           col0      col1      col2
index                              
a      0.455378  0.830872  0.495953
b      0.707146  0.832009  0.112540
c      0.894997  0.156364  0.521047
d      0.775462  0.482554  0.578177

read from disk:
           col0      col1      col2
index                              
a      0.455378  0.830872  0.495953
b      0.707146  0.832009  0.112540
c      0.894997  0.156364  0.521047
d      0.775462  0.482554  0.578177

Upvotes: 0

Related Questions