Reputation: 49
We are setting up new project level code directories, which will host PySpark, hive, Sqoop and shell wrapper scripts for different subProjects. We need to plan the structure of code directories considering long term goals.
Currently I have structure like -
Conf/
Scirpts/
- hql
- shell
- pyspark
...
but above structure get messy as multiple subProject start having codes, too many files and too much to manage and tough to search.
Can someone suggest, whats ideal way or any better way to arrange code directories as per the past experience?
Upvotes: 0
Views: 57
Reputation: 191710
Given that code is usually submitted from an edge-node, I would suggest limiting SSH access to certain users, then dividing HDFS at least into user accounts... HDFS already has a /user
directory, so start there.
Hortonworks at least puts common files for Hive in /apps/hive/
, Spark in /apps/spark
, etc. So there is a landing spot for shared libraries.
If you have project specific files that can't be placed in a single directory and need finer grained ACLs than user directories, then /projects
or just brand new folders in the root of HDFS should be fine.
The OCD approach to divide completely isolated projects would be to setup HDFS Federation and Namespaces where you'd have a NameNode for each major initiative within the company.
Upvotes: 1