Reputation: 1300
Applying this or this example to build my program, each time I try to insert to Big Query, I have this error:
OverflowError: date value out of range [while running 'Format']
My Beam Pipeline is this:
Bigquery = (transformation
| 'Format' >> beam.ParDo(FormatBigQueryoFn())
| 'Write to BigQuery' >> beam.io.Write(beam.io.BigQuerySink(
'XXXX',
schema=TABLE_SCHEMA,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)))
In class FormatBigQueryoFn is where it should be the logic of the window data time
The code of exmple 1:
def timestamp2str(t, fmt='%Y-%m-%d %H:%M:%S.000'):
"""Converts a unix timestamp into a formatted string."""
return datetime.fromtimestamp(t).strftime(fmt)
class TeamScoresDict(beam.DoFn):
"""Formats the data into a dictionary of BigQuery columns with their values
Receives a (team, score) pair, extracts the window start timestamp, and
formats everything together into a dictionary. The dictionary is in the format
{'bigquery_column': value}
"""
def process(self, team_score, window=beam.DoFn.WindowParam):
team, score = team_score
start = timestamp2str(int(window.start))
yield {
'team': team,
'total_score': score,
'window_start': start,
'processing_time': timestamp2str(int(time.time()))
}
The code of example 2:
class FormatDoFn(beam.DoFn):
def process(self, element, window=beam.DoFn.WindowParam):
ts_format = '%Y-%m-%d %H:%M:%S.%f UTC'
window_start = window.start.to_utc_datetime().strftime(ts_format)
window_end = window.end.to_utc_datetime().strftime(ts_format)
return [{'word': element[0],
'count': element[1],
'window_start':window_start,
'window_end':window_end}]
What could be wrong in my pipeline?
EDIT:
If I print, for example, the window.start i get:
Timestamp(-9223372036860)
Upvotes: 2
Views: 1286
Reputation: 1300
The problem was I was reading the data from a file before to test it with Google Pub/Sub.
While I was reading the data from a file the elements doesnt have timestamp.
Is a must to have a timestamp in your element.
Pub/Sub attach this timestamp automatically.
From documentation:
The simplest form of windowing is using fixed time windows: given a timestamped PCollection which might be continuously updating, each window might capture (for example) all elements with timestamps that fall into a five minute interval.
Upvotes: 1