Reputation: 924
I have a DataFrame
with named rows and columns indexes:
import numpy as np
import pandas as pd
I = pd.Index(["a", "b", "c", "d"], name="rows")
C = pd.Index(["col0", "col1", "col2"], name="cols")
df = pd.DataFrame(data=np.random.rand(4, 3),
index=I,
columns=C)
I have tried to store it in several formats (Excel, CSV) but when re-reading the file, the names are lost (maybe I have missed some options). Msgpack
works but it is marked as experimental so I would prefer to avoid it for now. I would also prefer to avoid pickle
. Is there any way (format and option) to store the name of the 2 indexes?
EDIT: I know how to write and read CSV with pandas. The problem is to save the name of the column index and of the row index.
Upvotes: 5
Views: 19305
Reputation: 12620
You can use hdf.
import numpy as np
import pandas as pd
I = pd.Index(["a", "b", "c", "d"], name="rows")
C = pd.Index(["col0", "col1", "col2"], name="columns")
df = pd.DataFrame(data=np.random.rand(4,3), index=I, columns=C)
print(df)
columns col0 col1 col2
rows
a 0.098497 0.918954 0.642800
b 0.168266 0.678434 0.455059
c 0.434939 0.244027 0.599400
d 0.877356 0.053085 0.182661
df.to_hdf('test.hdf', 'test')
print(pd.read_hdf('test.hdf'))
columns col0 col1 col2
rows
a 0.098497 0.918954 0.642800
b 0.168266 0.678434 0.455059
c 0.434939 0.244027 0.599400
d 0.877356 0.053085 0.182661
Upvotes: 8
Reputation: 1471
Does your DataFrame here actually have the index name? On my computer I need to write
I = pd.Index(["a", "b", "c", "d"], name="rows")
instead of
I = pd.Index(["a", "b", "c", "d"], "rows")
for the name to be actually assigned. Then using df.to_csv() and pd.read_csv() the index name does get preserved (in the csv the index name is just saved like another column).
If that doesn't work, do you need to be able to open it with programs other than Python? If not, you could use pickle which should leave the object untouched:
import pickle
pickle.dump(df, open("File.pickled", "wb"))
df_read = pickle.load(open("File.pickled", "rb"))
EDIT: If you want the name of the columns index to be preserved with to_csv() as well you could do the following hack:
Save it with (converts the columns index to a row, hence setting its column index name as a row index value)
df.T.reset_index().T.to_csv("DataFrame.csv")
And then read it with either of these two, first one being (very) slightly faster:
df_read = pd.read_csv("DataFrame.csv", index_col=0)
df_read.columns = df_read.loc['cols']
df_read = df_read.drop('cols', axis=0)
or
df_read = pd.read_csv("DataFrame.csv", index_col=0)
df_read = df_read.T.set_index('cols').T
Upvotes: 1
Reputation: 8613
You can export the DataFrame to a csv-file using .to_csv()
and read it back in using .read_csv()
. I extended the code you already had as follows:
#!/usr/bin/env python3
# coding: utf-8
import numpy as np
import pandas as pd
I = pd.Index(["a", "b", "c", "d"], "rows")
C = pd.Index(["col0", "col1", "col2"], "cols")
df = pd.DataFrame(data=np.random.rand(4,3), index=I, columns=C)
# export DataFrame to csv
df.to_csv('out.csv')
# set index_col in order to read first column as indices
df_in = pd.read_csv('out.csv', index_col=0)
So the DataFrame df
looks like this:
col0 col1 col2
a 0.590016 0.834033 0.535310
b 0.421589 0.897302 0.029500
c 0.373580 0.109005 0.239181
d 0.473872 0.075918 0.751628
The csv-file out.csv
looks like this:
,col0,col1,col2
a,0.5900160748408918,0.8340332218911729,0.5353103406507513
b,0.42158899389955884,0.8973015040807538,0.029500416731096046
c,0.37357951184145965,0.10900495955642386,0.2391805787788026
d,0.47387186813644167,0.07591794371425187,0.7516279365972057
Reading the data back in leads to the DataFrame df_in
as follows:
col0 col1 col2
a 0.590016 0.834033 0.535310
b 0.421589 0.897302 0.029500
c 0.373580 0.109005 0.239181
d 0.473872 0.075918 0.751628
So df2
is exactly the same as df
which shows that export and the desired import is working as expected.
EDIT to export column and index names:
df.to_csv('out.csv', index_label=[df.index.name, df.columns.name])
However, this makes re-importing a bit difficult since the columns name is added as a additional column. Normally, this is useful for multi-indexed data, but leads to an additional empty column here.
So I would suggest to export the index name, only:
# export DataFrame to csv
df.to_csv('out.csv', index_label=df.index.name)
# set index_col in order to read first column as indices
df_in = pd.read_csv('out.csv', index_col=0)
which leads to df_in
as:
col0 col1 col2
rows
a 0.442467 0.959260 0.626502
b 0.639044 0.989795 0.853002
c 0.576137 0.350260 0.532920
d 0.235698 0.095978 0.194151
I do not know why you need to export the names of both index and colums. If you simply want to access the row or column names you can get their label like this:
column_labels = df.columns.get_values()
>>> array(['col0', 'col1', 'col2'], dtype=object)
index_labels = df.index.get_values()
>>> array(['a', 'b', 'c', 'd'], dtype=object)
Upvotes: 2
Reputation: 36
header=True
(for the column names) and index=True
(for the index values) in df.to_csv
should do the trick. Both are True
by default, say the docs. However, you will have to manually set the index after loading, it is being loaded as a regular column. For this, I have added a name to the index:
import numpy as np
import pandas as pd
# your code:
I = pd.Index(["a", "b", "c", "d"], "rows")
C = pd.Index(["col0", "col1", "col2"], "cols")
df = pd.DataFrame(data=np.random.rand(4, 3),
index=I,
columns=C)
# name the index
I.name = 'index'
print "original:"
print df
print ""
df.to_csv("~/test.csv", index=True, header=True)
df2 = pd.read_csv("~/test.csv").set_index('index')
print "read from disk:"
print df2
print ""
Output:
original:
col0 col1 col2
index
a 0.455378 0.830872 0.495953
b 0.707146 0.832009 0.112540
c 0.894997 0.156364 0.521047
d 0.775462 0.482554 0.578177
read from disk:
col0 col1 col2
index
a 0.455378 0.830872 0.495953
b 0.707146 0.832009 0.112540
c 0.894997 0.156364 0.521047
d 0.775462 0.482554 0.578177
Upvotes: 0