Reputation: 26221
I figured I'd share, since I searched that on SO and couldn't quite find what I needed.
I wanted to dump a pd.DataFrame
into a yaml file.
Timestamps
should be shown nicely, not as the default:
date: !!python/object/apply:pandas._libs.tslibs.timestamps.Timestamp
- 1589241600000000000
- null
- null
Also, the output should be correct YaML format, i.e., it should be readable back by yaml.load
. The output should be reasonably concise, i.e. preferring the 'flow' format.
As an example, here is some data:
df = pd.DataFrame([
dict(
date=pd.Timestamp.now().normalize() - pd.Timedelta('1 day'),
x=0,
b='foo',
c=[1,2,3,4],
other_t=pd.Timestamp.now(),
),
dict(
date=pd.Timestamp.now().normalize(),
x=1,
b='bar',
c=list(range(32)),
other_t=pd.Timestamp.now(),
),
]).set_index('date')
Upvotes: 2
Views: 3865
Reputation: 26221
Here is what I came up with. It has some customization of the Dumper
to handle Timestamp
. The output is more legible, and still valid yaml. Upon loading, yaml recognizes the format of a valid datetime (ISO format, I think), and re-creates those as datetime
. In fact, we can read it back into a DataFrame
, where these datetime
s are automatically converted into Timestamp
. After a minor reset of index, we observe that the new df
is identical to the original.
import yaml
from yaml import CDumper
from yaml.representer import SafeRepresenter
import datetime
class TSDumper(CDumper):
pass
def timestamp_representer(dumper, data):
return SafeRepresenter.represent_datetime(dumper, data.to_pydatetime())
TSDumper.add_representer(datetime.datetime, SafeRepresenter.represent_datetime)
TSDumper.add_representer(pd.Timestamp, timestamp_representer)
With this, now we can do:
text = yaml.dump(
df.reset_index().to_dict(orient='records'),
sort_keys=False, width=72, indent=4,
default_flow_style=None, Dumper=TSDumper,
)
print(text)
The output is relatively clean:
- date: 2020-05-12 00:00:00
x: 0
b: foo
c: [1, 2, 3, 4]
other_t: 2020-05-13 02:30:23.422589
- date: 2020-05-13 00:00:00
x: 1
b: bar
c: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
other_t: 2020-05-13 02:30:23.422613
Now, we can load this back:
df2 = pd.DataFrame(yaml.load(text, Loader=yaml.SafeLoader)).set_index('date')
And (drum roll, please):
df2.equals(df)
# True
Upvotes: 4