Reputation: 135
Below is a snippet that converts data into a NumPy array. It is then converted to a Pandas DataFrame where I intend to process it. I'm attempting to convert it back to a NumPy array. I'm failing at this. Badly.
import pandas as pd
import numpy as np
from pprint import pprint
data = [
('2020-11-01 00:00:00', 1.0),
('2020-11-02 00:00:00', 2.0)
]
coordinatesType = [('timestamp', 'datetime64[s]'), ('value', '<f8')]
npArray = np.asarray(data, coordinatesType)
df = pd.DataFrame(data = npArray)
# do some pandas processing, then convert back to a numpy array
mutatedNpArray = df.to_numpy(coordinatesType)
pprint(mutatedNpArray)
# don't suply dtype for kicks
pprint(df.to_numpy())
This yields crazytown:
array([[('2020-11-01T00:00:00', 1.6041888e+18),
('1970-01-01T00:00:01', 1.0000000e+00)],
[('2020-11-02T00:00:00', 1.6042752e+18),
('1970-01-01T00:00:02', 2.0000000e+00)]],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
array([[Timestamp('2020-11-01 00:00:00'), 1.0],
[Timestamp('2020-11-02 00:00:00'), 2.0]], dtype=object)
I realize a DataFrame is really a fancy NumPy array under the hood, but I'm passing back to a function that accepts a simple NumPy array. Clearly I'm not handling dtypes correctly and/or I don't understand the data structure inside my DataFrame. Below is what the function I'm calling expects:
[('2020-11-01T00:00:00', 1.000 ),
('2020-11-02T00:00:00', 2.000 )],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
I'm really lost on how to do this. Or what I should be doing instead.
Help!
As @hpaul suggested, I tried the following:
# ...
df = df.set_index('timestamp')
# do some pandas processing, then convert back to a numpy array
mutatedNpArray = df.to_records(coordinatesType)
# ...
All good!
Upvotes: 1
Views: 84
Reputation: 4929
Besides the to_records
approach mentioned in comments, you can do:
df.apply(tuple, axis=1).to_numpy(coordinatesType)
Output:
array([('2020-11-01T00:00:00', 1.), ('2020-11-02T00:00:00', 2.)],
dtype=[('timestamp', '<M8[s]'), ('value', '<f8')])
Considerations:
I believe the issue here is related to the difference between the original array and the dataframe.
The shape your original numpy array is (2,)
, where each value is a tuple. When creating the dataframe, both df.shape
and df.to_numpy()
shapes are (2, 2)
so that the dtype
constructor does not work as expected. When converting rows to tuples into a pd.Series
, you get the original shape of (2,)
.
Upvotes: 1