giaosudau
giaosudau

Reputation: 2251

How to query data from AmazonS3 using Apache Drill?

After using Spark to saveAsTextFile to S3 then format like Hadoop. Format of file in bucket like that.

enter image description here

Format is YEAR/MONTH/DATE/TIMESTAMP and data is in part-0000 file is json format.

I config drill and point into my bucket name

{
  "type": "file",
  "enabled": true,
  "connection": "s3://com.giaosudau.win-bid",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": "json"
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json"
    },
    "avro": {
      "type": "avro"
    }
  }
}

I select the storage is ok. But don't know to select and get data for a date or hour or file.

enter image description here

I also try some thing like to select a file but not success:

enter image description here

File sample here

{"auctionId":"xx","bidRequestString":{"id":"xx","timestamp":"2015-07-29T08:31:00.413Z","isTest":false,"url":"http://www.222.3232.com/","userAgent":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","protocolVersion":"Google Protocol Buffer","exchange":"adx","provider":"Google","location":{"dma":-1,"metro":-1,"timezoneOffsetMinutes":-1},"segments":{"AdxDetectedVerticals":["23:0.571243","355:0.409098","380:0.339647","474:0.415936","540:0.079871"]},"userIds":{"prov":"bGZqaWdsa2dmbmRubWxtamdoaGs","xchg":"bGZqaWdsa2dmbmRubWxtamdoaGs"},"imp":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"spots":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"site":{"id":"502","publisher":{"id":"502"},"page":"http://www.blog.34343.com/"},"device":{"ua":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","ip":"109.103.101.0","geo":{"zip":" 2600"},"language":"en","ext":{"geo_criteria_id":1000142}},"user":{"id":"bGZqaWdsa2dmbmRubWxtamdoaGs","ext":{"cookie_age_seconds":2080089}}},"bidResponseCreative":{"itemId":"fsknxfwe34235235","campaignId":"332","htmlSnippet":"\u003ciframe frameborder=0 scrolling=no width=\"300\" height=\"250\" src=\"//sv.brand-display.com/adedge/api/bd/serving/simple/frame?aukey=34343\u0026_=%CACHEBUSTER%\u0026winning_price=%WINNING_PRICE%\u0026google_click_url=%CLICK_URL_ESC%\u0026encrypt_value=%ENCRYPT_VALUE%\"\u003e\u003c/iframe\u003e","name":"Expandable Web","formatCode":"billboard","bd":"","status":1,"deleted":false,"destinationUrl":"https://ex.sg","tagging":[1,2,3],"expandingDirection":14,"bdUrl":"","format":{"code":"billboard","name":"Billboard","publisher":"default","type":"standard","width":120,"height":600,"expanded_width":0,"expanded_height":0,"collapsed_width":0,"collapsed_height":0,"aspratio":3.76,"expandable":true,"expand_first":true}},"bidResponseCreativeId":"1","bidResponseCreativeName":"","bidResponseData":{"bids":[{"creative":0,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":0},{"creative":1,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":1}]},"bidResponseMeta":"null","bidWinMeta":"null","biddingAgentName":"iface.http","biddingFullAccount":"23848834:strategy","biddingMainAccount":"1212312","biddingMaxPrice":"6071USD/1M","biddingRequestFormatType":"datacratic","biddingSubAccount":"strategy","impIndex":"1","impressionId":"199","pricePriority":"1.000000","rawWinPrice":"100USD/1M","timestamp":"2015-Jul-29 08:31:00.53899","userIds":{"prov":"123232","xchg":"9849839"},"winPrice":"100USD/1M"}
{"auctionId":"343","bidRequestString":{"id":"344","timestamp":"2015-07-28T08:31:00.413Z","isTest":false,"url":"http://www.xx.xx.com/","userAgent":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","protocolVersion":"Google Protocol Buffer","exchange":"adx","provider":"Google","location":{"dma":-1,"metro":-1,"timezoneOffsetMinutes":-1},"segments":{"AdxDetectedVerticals":["23:0.571243","355:0.409098","380:0.339647","474:0.415936","540:0.079871"]},"userIds":{"prov":"bGZqaWdsa2dmbmRubWxtamdoaGs","xchg":"bGZqaWdsa2dmbmRubWxtamdoaGs"},"imp":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"spots":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"site":{"id":"502","publisher":{"id":"502"},"page":"http://www.blog.343.com/"},"device":{"ua":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","ip":"109.103.101.0","geo":{"zip":" 2600"},"language":"en","ext":{"geo_criteria_id":1000142}},"user":{"id":"bGZqaWdsa2dmbmRubWxtamdoaGs","ext":{"cookie_age_seconds":2080089}}},"bidResponseCreative":{"itemId":"fsknxfwe34235235","campaignId":"342342342","htmlSnippet":"\u003ciframe frameborder=0 scrolling=no width=\"300\" height=\"250\" src=\"//sv.brand-display.com/adedge/api/bd/serving/simple/frame?aukey=xxx\u0026_=%CACHEBUSTER%\u0026winning_price=%WINNING_PRICE%\u0026google_click_url=%CLICK_URL_ESC%\u0026encrypt_value=%ENCRYPT_VALUE%\"\u003e\u003c/iframe\u003e","name":"Expandable Web","formatCode":"billboard","bd":"","status":1,"deleted":false,"destinationUrl":"https://ex.sg","tagging":[1,2,3],"expandingDirection":14,"bdUrl":"","format":{"code":"billboard","name":"Billboard","publisher":"default","type":"standard","width":120,"height":600,"expanded_width":0,"expanded_height":0,"collapsed_width":0,"collapsed_height":0,"aspratio":3.76,"expandable":true,"expand_first":true}},"bidResponseCreativeId":"1","bidResponseCreativeName":"","bidResponseData":{"bids":[{"creative":0,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":0},{"creative":1,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":1}]},"bidResponseMeta":"null","bidWinMeta":"null","biddingAgentName":"iface.http","biddingFullAccount":"829838828928932:strategy","biddingMainAccount":"3434","biddingMaxPrice":"6071USD/1M","biddingRequestFormatType":"datacratic","biddingSubAccount":"strategy","impIndex":"1","impressionId":"199","pricePriority":"1.000000","rawWinPrice":"100USD/1M","timestamp":"2015-Jul-29 08:31:00.53899","userIds":{"prov":"cc","xchg":"cc"},"winPrice":"100USD/1M"}

How to get data from that?

Upvotes: 2

Views: 1454

Answers (1)

catpaws
catpaws

Reputation: 2283

Here are some examples using your data and the local file system instead of s3. I think your files do not have any extension, but if that is not true, if your files have a .json extension, you can query them like this (use the table alias t to resolve ambiguity):

use dfs.`root`;
+-------+---------------------------------------+
|  ok   |                summary                |
+-------+---------------------------------------+
| true  | Default schema changed to [dfs.root]  |
+-------+---------------------------------------+
1 row selected (0.079 seconds)
select t.auctionID from dfs.`/Users/khahn/drill/apache-drill-1.1.0/part-00000` t;
+------------+
| auctionId  |
+------------+
| xx         |
| 343        |
+------------+
2 rows selected (0.093 seconds)

If your files do not have an extension, modify the storage plugin configuration to set the defaultInputFormat to "json"; otherwise, Drill assumes the file is of the Parquet format by default. For example, I set the defaultInputFormat in my root workspace:

{
  "type": "file",
  "enabled": true,
  "connection": "file:///",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": "json"
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
. . .

When you query the files having no extension, you need to specify that workspace in which you defined defaultInputFormat, root in this example:

select t.auctionID from dfs.`root`.`/Users/khahn/drill/apache-drill-1.1.0/part-00000` t;
+------------+
| auctionId  |
+------------+
| xx         |
| 343        |
+------------+
2 rows selected (0.095 seconds)

Please see docs about handling type differences: https://drill.apache.org/docs/json-data-model/#handling-type-differences

Upvotes: 2

Related Questions