Doug E Fresh
Doug E Fresh

Reputation: 291

parquet int96 timestamp conversion to datetime/date via python

TL;DR

I'd like to convert an int96 value such as ACIE4NxJAAAKhSUA into a readable timestamp format like 2020-03-02 14:34:22 or whatever that could be normally interpreted...I mostly use python so I'm looking to build a function that does this conversion. If there's another function that can do the reverse -- even better.

Background

I'm using parquet-tools to convert a raw parquet file (with snappy compression) to raw JSON via this commmand:

C:\Research> java -jar parquet-tools-1.8.2.jar cat --json original-file.snappy.parquet > parquet-output.json

Inside the JSON, I'm seeing these values as the timestamps:

{... "_id":"101836","timestamp":"ACIE4NxJAAAKhSUA"}

I've determined that the timestamp value of "ACIE4NxJAAAKhSUA" is really int96 (this is also confirmed with reading the schema of the parquet file too....

message spark_schema {
 ...(stuff)...
  optional binary _id (UTF8);
  optional int96 timestamp;
}

I think this is also known as Impala Timestamp as well (at least that's what I've gathered)

Further Issue Research

I've been searching everywhere for some function or info regarding how to "read" the int96 value (into python -- I'd like to keep it in that language since I'm most familiar with it) and output the timestamp -- I've found nothing.

Here's a very articles I've already looked into (that's related to this subject):

Regarding the depreciated int96 timestamp

Please don't ask me to stop using an old/depreciated timestamp format within a parquet file, I'm well aware of that with the research I've done so far. I'm a receiver of the file/data -- I can't change the format used on creation.

If there's another way to control the initial JSON output to deliver a "non int96" value -- I'd be interested in that as well.

Thanks so much for your help SO community!

Upvotes: 6

Views: 11084

Answers (2)

Stephen Simmons
Stephen Simmons

Reputation: 8151

Doug, this code from arrow/cpp/src/parquet/types.h shows how the Int96 timestamps are stored internally:

constexpr int64_t kJulianToUnixEpochDays = INT64_C(2440588);
constexpr int64_t kSecondsPerDay = INT64_C(60 * 60 * 24);
constexpr int64_t kMillisecondsPerDay = kSecondsPerDay * INT64_C(1000);
constexpr int64_t kMicrosecondsPerDay = kMillisecondsPerDay * INT64_C(1000);
constexpr int64_t kNanosecondsPerDay = kMicrosecondsPerDay * INT64_C(1000);

MANUALLY_ALIGNED_STRUCT(1) Int96 { uint32_t value[3]; };
STRUCT_END(Int96, 12);

static inline void Int96SetNanoSeconds(parquet::Int96& i96, int64_t nanoseconds) {
  std::memcpy(&i96.value, &nanoseconds, sizeof(nanoseconds));
}

static inline int64_t Int96GetNanoSeconds(const parquet::Int96& i96) {
  // We do the computations in the unsigned domain to avoid unsigned behaviour
  // on overflow.
  uint64_t days_since_epoch =
      i96.value[2] - static_cast<uint64_t>(kJulianToUnixEpochDays);
  uint64_t nanoseconds = 0;

  memcpy(&nanoseconds, &i96.value, sizeof(uint64_t));
  return static_cast<int64_t>(days_since_epoch * kNanosecondsPerDay + nanoseconds);
}

Upvotes: 1

AngryPanda
AngryPanda

Reputation: 1281

parquet-tools will not be able to change format type from INT96 to INT64. What you are observing in json output is a String representation of the timestamp stored in INT96 TimestampType. You will need spark to re-write this parquet with timestamp in INT64 TimestampType and then the json output will produce a timestamp (in the format you desire).

You will need to set a specific config in Spark -

spark-shell --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS

2020-03-16 11:37:50 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.0.20:4040
Spark context available as 'sc' (master = local[*], app id = local-1584383875924).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

val sourceDf = spark.read.parquet("original-file.snappy.parquet")
2020-03-16 11:38:31 WARN  Utils:66 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
sourceDf: org.apache.spark.sql.DataFrame = [application: struct<name: string, upgrades: struct<value: double> ... 3 more fields>, timestamp: timestamp ... 16 more fields]

scala> sourceDf.repartition(1).write.parquet("Downloads/output")

Parquet-tools will show the correct TimestampType

parquet-tools schema Downloads/output/part-00000-edba239b-e696-4b4e-8fd3-c7cca9eea6bf-c000.snappy.parquet 

message spark_schema {
  ...
  optional binary _id (UTF8);
  optional int64 timestamp (TIMESTAMP_MICROS);
  ...
}

And the json dump gives -

parquet-tools cat --json Downloads/output/part-00000-edba239b-e696-4b4e-8fd3-c7cca9eea6bf-c000.snappy.parquet

{..."_id":"101836", "timestamp":1583973827000000}

The timestamp recorded is in nano seconds. Hope this helps!

Upvotes: 6

Related Questions