Shan
Shan

Reputation: 2221

Apache Drill - Query HDFS and SQL

I'm trying to explore Apache Drill. I'm not a Data Analyst, just an Infra support Guy. I see documentation on Apache Drill is too limited

I need some details about custom data storage that can be used with Apache Drill

  1. Is it possible to query HDFS without Hive, using Apache Drill just like dfs do
  2. Is it possible to query old age RDBMS like MySQL and Microsoft SQL

Thanks in advance

Update:

My HDFS Storage defention says error (Invalid JSON mapping)

{  
  "type":"file",
  "enabled":true,
  "connection":"hdfs:///",
  "workspaces":{  
    "root":{  
      "location":"/",
      "writable":true,
      "storageformat":"null"
    }
  }
}

If I replace hdfs:/// with file:///, it seems to accept it.

I copied all the library files from the folder

<drill-path>/jars/3rdparty to <drill-path>/jars/

Cannot make it work. Please help. I'm not a dev at all, I'm Infra guy.

Thanks in advance

Upvotes: 1

Views: 1655

Answers (2)

Jinith
Jinith

Reputation: 438

YES, it is possible that drill can communicate with both the Hadoop system and the RDBMS systems together. Infact you can have queries joining both the systems.

The HDFS storage plug in can be as :

{
  "type": "file",
  "enabled": true,
  "connection": "hdfs://xxx.xxx.xxx.xxx:8020/",
  "workspaces": {
    "root": {
      "location": "/user/cloudera",
      "writable": true,
      "defaultInputFormat": null
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "parquet": {
      "type": "parquet"
    },
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "json": {
      "type": "json"
    }
  }
}

The connection URL will be your mapR/Coudera URL with port number 8020 by default . You should be able to spot that in the configuration of Hadoop on your system with configuration key : "fs_defaultfs"

Upvotes: 0

Yash Sharma
Yash Sharma

Reputation: 1704

  1. Yes.

Drill directly recognizes the schema of the file based on the metadata. Refer the link for more info -

https://cwiki.apache.org/confluence/display/DRILL/Connecting+to+Data+Sources

  1. Not Yet.

While there is a MapR driver that lets you achieve the same but it is not inherently supported in Drill now. There have been several discussions around this and it might be there soon.

Upvotes: 1

Related Questions