Reputation: 81
I've created a HDInsight cluster with Apache Spark using Script Actions as described in Install and use Spark 1.0 on HDInsight Hadoop clusters:
You can install Spark on any type of cluster in Hadoop on HDInsight using Script Action cluster customization. Script action lets you run scripts to customize a cluster, only when the cluster is being created. For more information, see Customize HDInsight cluster using script action.
I have ran a basic Python (word count sample) script that worked, but when I start a Python script that uses NumPy
I get this importer error: 'No module named numpy'
raised on the nodes.
Why can't I import the package since NumPy was (supposed to be) installed out-of-the-box on a HDInsight cluster? Is there a way to install NumPy on the nodes? HDInsight doesn't allow you any access to the nodes.
Upvotes: 1
Views: 1464
Reputation: 671
You can use custom script as mentioned in the answers, however the below command worked for me in Hbase - Hdinsight Cluster. (It should work in Hadoop - Hdinsight Cluster as well.)
sudo apt-get install python-numpy
Upvotes: 1
Reputation: 4455
You can use Script Actions to apply custom packages to all the data nodes in an HDInsight cluster. The docs are at http://acom-sandbox.azurewebsites.net/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/
Roughly speaking what you want to do is create your cluster in PowerShell and include something like:
$config = Add-AzureHDInsightScriptAction -Config $config –Name MyScriptActionName –Uri http://uri.to/scriptaction.ps1 –Parameters MyScriptActionParameter -ClusterRoleCollection HeadNode,DataNode
The script at http://uri.to/scriptaction.ps1 can easily be stored on blob storage, and is run on the node types specified. That's script you would use to install any custom python (or other) packages.
Upvotes: 3