user1101221
user1101221

Reputation: 414

AWS Athena select query result include data inconstantly

Environment
Collecting rawdata from python3.7 -> boto3 -> S3.
With parquet (serialized with fastparquet)
Using Glue Crawler with non-customize (just create, assign IAM role, s3 target, make empty data-catalog)



Problem
Following athena query returns only until 2020-04-01:

SELECT * FROM "rawdata" where "partition_0" >= '2020-03-29' and "partition_0" <= '2020-04-02';
SELECT * FROM "rawdata" where  ("partition_0" = '2020-03-29' or "partition_0" = '2020-03-30' or "partition_0" = '2020-03-31' or "partition_0" = '2020-04-01' or "partition_0" = '2020-04-02')
#These two queries are same meaning, same result.  

But, If I query to athena fowlling next same time, It returns 2020-04-02.

SELECT * FROM "rawdata" where "partition_0" >= '2020-04-02' and "partition_0" <= '2020-04-02';
SELECT * FROM "rawdata" where "partition_0" = '2020-04-02';
#Also these two queries are same meaning, same result.  



Structure
S3 partitions are folling next format :

bucketname/collectorname/merged/rawdata/yyyy-mm-dd/data.parquet

Glue Crawler has own Data-Catalog that's name is same with Glue Crawler.
And Glue Crawler's target is next

bucketname/collectorname/merged/rawdata

Every Glue Crawler's IAM Role is same, they have two AWS managed policies.

AWSGlueServiceRole
AmazonS3ReadOnlyAccess

The parquet file is saved with pandas.to_parquet with fastparquet, not compressed.
I didn't edit any script in Glue Crawler.



Workflow
A collector collects 3000 rows every 3 mins. (Mon to Fri, 9AM to 15:30PM)
So it merge onto 3000 columned parquet to save to next format

#always next data (because 3 min term, but it is saperated with seconds)
bucketname/collectorname/notmerged/rawdata/yyyy-mm-dd/hh_mm_ss.parquet.bz2

#always overwrited, if someone request latest snapshot, system just use it (not athena)
bucketname/collectorname/cached/latest/data.parquet

And next time, the parser works.
Parser is also made by python3.7 and It's pseudo code are next.

import pandas as pd
import io
import boto3MyCustomWrapped

#download last collected dataframe
dfnew = pd.read_parquet(io.BytesIO(unzip(boto3MyCustomWrapped.s3.get_object(bucket="bucketname",key="collectorname/cached/latest/data.parquet.bz2")))

#today yyyy-mm-dd
strftime_of_today_datetime_yyyy_mm_dd = datetime.datetime.utcnow().strftime('%Y-%m-%d')

#merged data
dfold = pd.read_parquet(io.BytesIO(boto3MyCustomWrapped.s3.get_object(bucket="bucketname",key=f"collectorname/merged/{strftime_of_today_datetime_yyyy_mm_dd}/data.parquet"))

#some process is skipped (like processing if dfold not exist)
dfmerged = pd.concat([dfold, dfnew])

#validate for athena optimize (like column type clean)
dfmerged = validate_and_process_and_sort_and_clean_index(dfmerged)

#upload overwrite onto dfold's s3 path (!!data catalog is connected only this path!!)
boto3MyCustomWrapped.s3.put_object_from_dataframe(bucket="bucketname",
    key=f"collectorname/merged/{strftime_of_today_datetime_yyyy_mm_dd}/data.parquet", dfmerged)

#today's glue crawling
if datetime.datetime.utcnow().hour == 9 and datetime.datetime.utcnow().minute < 10:
    boto3MyCustomWrapped.glue.try_start_crawler('collectorname')



Question
How can I working Athena returns include today with 'including today' query? (not only 'exact today' query)
The problem is Athena result include or not include 'today' data conditional, even every my query contains 'today'.
And I don't know why, and how to fix



More Information

There no error returned Athena all situations, and all returns have normal columns.

Upvotes: 0

Views: 881

Answers (2)

user1101221
user1101221

Reputation: 414

Long time watched, That problem never happen again. (I didn't do anything)
In short finally I can't figured out why it happen.
I feel little bit inscure about that, but I feeling responsibility of Gabip's help for my question.

All thing I did it is

  1. Make new data-catalog that's name without '.' (for example, catalog name is "crawling.siteurl.itemtype" and table name is "rawdata", I made a new table with named "crawlingsiteurlitemtype"."rawdata")
  2. Same glue crawling from same source.
  3. show partitions to "crawlingsiteurlitemtype"."rawdata" working good (and it contains 2020-04-02)
  4. show partitions and msck repair dosen't work with my origin table. ("crawling.siteurl.itemtype"."rawdata")
  5. but after 3 and 4, suddenly result of athena query to my origin table contains '2020-04-02' data.
  6. I digged it today, but I can't find what makes it work, and Today, querying to the original data-catalog still returns good response of '2020-04-06(=today)'
  7. So I give up digging how to know, But feel little bit inscure because Someday, It may be break out again.

Anyway, thank you for answer!

Upvotes: 0

Gabio
Gabio

Reputation: 9504

In order to query new partitions in athena, you should explicitly add them:

ALTER TABLE rawdata ADD
  PARTITION (partition_0 = '2020-04-02');

or by adding all the partitions together using:

msck repair table rawdata

I guess that in your case, the first query ran before the partition was added (by the glue crawler) and that's why the data of '2020-04-02' was unavailable.

Upvotes: 0

Related Questions