Griffith Rees
Griffith Rees

Reputation: 1384

How can I efficiently save a python pandas dataframe in hdf5 and open it as a dataframe in R?

I think the title covers the issue, but to elucidate:

The pandas python package has a DataFrame data type for holding table data in python. It also has a convenient interface to the hdf5 file format, so pandas DataFrames (and other data) can be saved using a simple dict-like interface (assuming you have pytables installed)

import pandas 
import numpy
d = pandas.HDFStore('data.h5')
d['testdata'] = pandas.DataFrame({'N': numpy.random.randn(5)})
d.close()

So far so good. However, if I then try to load that same hdf5 into R I see things aren't so simple:

> library(hdf5)
> hdf5load('data.h5')
NULL
> testdata
$block0_values
         [,1]      [,2]      [,3]       [,4]      [,5]
[1,] 1.498147 0.8843877 -1.081656 0.08717049 -1.302641
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"

$block0_items
[1] "N"
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"
attr(,"kind")
[1] "string"
attr(,"name")
[1] "N."

$axis1
[1] 0 1 2 3 4
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"
attr(,"kind")
[1] "integer"
attr(,"name")
[1] "N."

$axis0
[1] "N"
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"
attr(,"kind")
[1] "string"
attr(,"name")
[1] "N."

attr(,"TITLE")
[1] ""
attr(,"CLASS")
[1] "GROUP"
attr(,"VERSION")
[1] "1.0"
attr(,"ndim")
[1] 2
attr(,"axis0_variety")
[1] "regular"
attr(,"axis1_variety")
[1] "regular"
attr(,"nblocks")
[1] 1
attr(,"block0_items_variety")
[1] "regular"
attr(,"pandas_type")
[1] "frame"

Which brings me to my question: ideally I would be able to save back and forth from R to pandas. I can obviously write a wrapper from pandas to R (I think... though I think if I use a pandas MultiIndex that might become trickier), but I don't think I can easily then use that data back in pandas. Any suggestions?

Bonus: what I really want to do is use the data.table package in R with a pandas dataframe (the keying approach is suspiciously similar in both packages). Any help on that one greatly appreciated.

Upvotes: 14

Views: 12526

Answers (5)

Rich Signell
Rich Signell

Reputation: 16355

How to write a dataframe in HDF5 so it can be read in R is now in the Pandas documentation: http://pandas-docs.github.io/pandas-docs-travis/io.html#external-compatibility

Upvotes: 2

Ben
Ben

Reputation: 21625

I recommend using feather, built by Wes and Hadley to solve the problem of transferring data between R and Python efficiently.

Python

import numpy as np
import pandas as pd
import feather as ft

df = pd.DataFrame({'N': np.random.randn(5)})
ft.write_dataframe(df, 'df.feather')

R

library(data.table)
library(feather)

dt <- data.table(read_feather("df.feather"))
dt
           N
1: 0.2777700
2: 1.4083377
3: 1.2940691
4: 0.8221348
5: 1.8552908

Upvotes: 2

Jeff
Jeff

Reputation: 128928

If you are still looking at this, take a look at this post on google groups. It shows how to exchange data between pandas/R via HDF5.

https://groups.google.com/forum/?fromgroups#!topic/pydata/0LR72GN9p6w

Upvotes: 8

Dale
Dale

Reputation: 4710

It would make sense to dropdown to pytables and store/get your data there.

Ultimately a DataFrame is a dict of Series which is what an HDF5 Table is. There are limitations on the translation due to incompatible dtypes but for numerical data it should be straight forward.

The way pandas stores its HDF5 is viewed more like a binary blob. It has to support all the nuances of a DataFrame which HDF5 does support cleanly.

https://github.com/dalejung/trtools/blob/master/trtools/io/pytables.py

Has some that kind of pandas/hdf5 munging code.

Upvotes: 3

Paul Hiemstra
Paul Hiemstra

Reputation: 60924

You could use csv files as the common data format. Both R and python pandas can easily work with that. You might lose some precision, but if this is a problem depends on your specific problem.

Upvotes: -1

Related Questions