Reputation: 414
Environment
Collecting rawdata from python3.7 -> boto3 -> S3.
With parquet (serialized with fastparquet)
Using Glue Crawler with non-customize (just create, assign IAM role, s3 target, make empty data-catalog)
Problem
Following athena query returns only until 2020-04-01:
SELECT * FROM "rawdata" where "partition_0" >= '2020-03-29' and "partition_0" <= '2020-04-02';
SELECT * FROM "rawdata" where ("partition_0" = '2020-03-29' or "partition_0" = '2020-03-30' or "partition_0" = '2020-03-31' or "partition_0" = '2020-04-01' or "partition_0" = '2020-04-02')
#These two queries are same meaning, same result.
But, If I query to athena fowlling next same time, It returns 2020-04-02.
SELECT * FROM "rawdata" where "partition_0" >= '2020-04-02' and "partition_0" <= '2020-04-02';
SELECT * FROM "rawdata" where "partition_0" = '2020-04-02';
#Also these two queries are same meaning, same result.
Structure
S3 partitions are folling next format :
bucketname/collectorname/merged/rawdata/yyyy-mm-dd/data.parquet
Glue Crawler has own Data-Catalog that's name is same with Glue Crawler.
And Glue Crawler's target is next
bucketname/collectorname/merged/rawdata
Every Glue Crawler's IAM Role is same, they have two AWS managed policies.
AWSGlueServiceRole
AmazonS3ReadOnlyAccess
The parquet file is saved with pandas.to_parquet with fastparquet, not compressed.
I didn't edit any script in Glue Crawler.
Workflow
A collector collects 3000 rows every 3 mins. (Mon to Fri, 9AM to 15:30PM)
So it merge onto 3000 columned parquet to save to next format
#always next data (because 3 min term, but it is saperated with seconds)
bucketname/collectorname/notmerged/rawdata/yyyy-mm-dd/hh_mm_ss.parquet.bz2
#always overwrited, if someone request latest snapshot, system just use it (not athena)
bucketname/collectorname/cached/latest/data.parquet
And next time, the parser works.
Parser is also made by python3.7 and It's pseudo code are next.
import pandas as pd
import io
import boto3MyCustomWrapped
#download last collected dataframe
dfnew = pd.read_parquet(io.BytesIO(unzip(boto3MyCustomWrapped.s3.get_object(bucket="bucketname",key="collectorname/cached/latest/data.parquet.bz2")))
#today yyyy-mm-dd
strftime_of_today_datetime_yyyy_mm_dd = datetime.datetime.utcnow().strftime('%Y-%m-%d')
#merged data
dfold = pd.read_parquet(io.BytesIO(boto3MyCustomWrapped.s3.get_object(bucket="bucketname",key=f"collectorname/merged/{strftime_of_today_datetime_yyyy_mm_dd}/data.parquet"))
#some process is skipped (like processing if dfold not exist)
dfmerged = pd.concat([dfold, dfnew])
#validate for athena optimize (like column type clean)
dfmerged = validate_and_process_and_sort_and_clean_index(dfmerged)
#upload overwrite onto dfold's s3 path (!!data catalog is connected only this path!!)
boto3MyCustomWrapped.s3.put_object_from_dataframe(bucket="bucketname",
key=f"collectorname/merged/{strftime_of_today_datetime_yyyy_mm_dd}/data.parquet", dfmerged)
#today's glue crawling
if datetime.datetime.utcnow().hour == 9 and datetime.datetime.utcnow().minute < 10:
boto3MyCustomWrapped.glue.try_start_crawler('collectorname')
Question
How can I working Athena returns include today with 'including today' query? (not only 'exact today' query)
The problem is Athena result include or not include 'today' data conditional, even every my query contains 'today'.
And I don't know why, and how to fix
More Information
There no error returned Athena all situations, and all returns have normal columns.
Upvotes: 0
Views: 881
Reputation: 414
Long time watched, That problem never happen again. (I didn't do anything)
In short finally I can't figured out why it happen.
I feel little bit inscure about that, but I feeling responsibility of Gabip's help for my question.
All thing I did it is
Anyway, thank you for answer!
Upvotes: 0
Reputation: 9504
In order to query new partitions in athena, you should explicitly add them:
ALTER TABLE rawdata ADD
PARTITION (partition_0 = '2020-04-02');
or by adding all the partitions together using:
msck repair table rawdata
I guess that in your case, the first query ran before the partition was added (by the glue crawler) and that's why the data of '2020-04-02' was unavailable.
Upvotes: 0