pierre_j
pierre_j

Reputation: 983

How to convert a float to a Parquet TIMESTAMP Logical Type?

Let say I have a pyarrow table with a column Timestamp containing float64. These floats are actually timestamps experessed in s. For instance:

import pyarrow as pa
my_table = pa.table({'timestamp': pa.array([1600419000.477,1600419001.027])})

I read about Parquet Logical Type from documentation. Please, how can I convert these float values to the Logical Type TIMESTAMP? I see no documentation about the way to do this.

Thank you for your help. Have a good day, Bests,

Upvotes: 0

Views: 1223

Answers (2)

joris
joris

Reputation: 139232

You will need to convert the floats into an actual timestamp type in pyarrow, and then it will automatically be written to a paruet logical timestamp type.

Using the pyarrow.compute module, this conversion can also be done in pyarrow (a bit less ergonomic as doing the conversion in pandas, but avoiding a conversion to pandas and back):

>>> import pyarrow.compute as pc
>>> arr = pa.array([1600419000.477,1600419001.027])
>>> pc.multiply(arr, pa.scalar(1000.)).cast("int64").cast(pa.timestamp('ms'))
<pyarrow.lib.TimestampArray object at 0x7fe5ec3df588>
[
  2020-09-18 08:50:00.477,
  2020-09-18 08:50:01.027
]

Upvotes: 2

0x26res
0x26res

Reputation: 13932

I don't think you'll be able to convert within arrow from floats to timestamp.

Arrow assumes timestamp are 64 bit integers of a given precision (ms, us, ns). In your case you have to multiply your seconds floats by the precision you want (1000 for ms), then convert to int64 and cast into timestamp.

Here's an example using pandas:

(
    pa.array([1600419000.477,1600419001.027])
    .to_pandas()
    .mul(1000)
    .astype('long')
    .pipe(pa.Array.from_pandas)
    .cast(pa.timestamp('ms'))
)

Which gives you:

<pyarrow.lib.TimestampArray object at 0x7fb5025b6a08>
[
  2020-09-18 08:50:00.477,
  2020-09-18 08:50:01.027
]

Upvotes: 1

Related Questions