amutter
amutter

Reputation: 81

Why do Python scripts on HDInsight fail with 'No module named numpy'?

I've created a HDInsight cluster with Apache Spark using Script Actions as described in Install and use Spark 1.0 on HDInsight Hadoop clusters:

You can install Spark on any type of cluster in Hadoop on HDInsight using Script Action cluster customization. Script action lets you run scripts to customize a cluster, only when the cluster is being created. For more information, see Customize HDInsight cluster using script action.

I have ran a basic Python (word count sample) script that worked, but when I start a Python script that uses NumPy I get this importer error: 'No module named numpy' raised on the nodes.

Why can't I import the package since NumPy was (supposed to be) installed out-of-the-box on a HDInsight cluster? Is there a way to install NumPy on the nodes? HDInsight doesn't allow you any access to the nodes.

Upvotes: 1

Views: 1464

Answers (2)

Lokesh
Lokesh

Reputation: 671

You can use custom script as mentioned in the answers, however the below command worked for me in Hbase - Hdinsight Cluster. (It should work in Hadoop - Hdinsight Cluster as well.)

sudo apt-get install python-numpy

Upvotes: 1

Simon Elliston Ball
Simon Elliston Ball

Reputation: 4455

You can use Script Actions to apply custom packages to all the data nodes in an HDInsight cluster. The docs are at http://acom-sandbox.azurewebsites.net/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/

Roughly speaking what you want to do is create your cluster in PowerShell and include something like:

$config = Add-AzureHDInsightScriptAction -Config $config –Name MyScriptActionName –Uri http://uri.to/scriptaction.ps1 –Parameters MyScriptActionParameter -ClusterRoleCollection HeadNode,DataNode

The script at http://uri.to/scriptaction.ps1 can easily be stored on blob storage, and is run on the node types specified. That's script you would use to install any custom python (or other) packages.

Upvotes: 3

Related Questions