Mikel Laburu
Mikel Laburu

Reputation: 157

Is it possible a PySpark project with Maven?

I don't have too much experience using maven and spark, but everything I did so far was in Scala. Now I have to develop a project in Pyspark and I was wondering if there is a possibility to create a project in Pyspark using maven, and if so how I would have to build the pom file.

Because so far in the pom I specified, for example, these properties:

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <maven.assembly.plugin.version>3.1.0</maven.assembly.plugin.version>
        <maven.antrun.plugin.version>1.8</maven.antrun.plugin.version>
        <maven.surefire.plugin.version>3.0.0-M5</maven.surefire.plugin.version>
        <maven.surefire.report.plugin.version>2.18.1</maven.surefire.report.plugin.version>
        <maven.shade.plugin.version>3.1.1</maven.shade.plugin.version>
        <maven.site.plugin.version>3.6</maven.site.plugin.version>
        <maven.project.info.reports.plugin.version>2.2</maven.project.info.reports.plugin.version>
        <scala.maven.plugin.version>4.1.1</scala.maven.plugin.version>
        <maven.scalastyle.plugin.version>1.0.0</maven.scalastyle.plugin.version>
        <encoding>UTF-8</encoding>
        <scala.version>2.11.12</scala.version>
        <spark.version>2.4.0.cloudera2</spark.version>
        <hive-service.version>3.1.2</hive-service.version>
        <spark.databricks.version>1.5.0</spark.databricks.version>
        ...
    </properties>

Would it be in the same way only changing <scala.version>2.11.12</scala.version> by <python.version>3.6</python.version>? Or something like that?

Upvotes: 1

Views: 1306

Answers (2)

ankursingh1000
ankursingh1000

Reputation: 1419

Languages supported by spark are

  1. Java
  2. Scala
  3. Python
  4. R

Spark submit command

./bin/spark-submit \
 --class <main-class> \
 --master <master-url> \
 --deploy-mode <deploy-mode> \
 --conf <key>=<value> \
 ... # other options
 <application-jar> \
  [application-arguments]

You can explore the different language support from https://spark.apache.org/.

These different languages have their different build and deploy strategy

Eg: For java / scala - you can use Gradle or Maven for building, which will produce a jar file, which you can use to run on any machine which has java and spark setup.

 ./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100

python - you can use pybuilder to build a zip file or can build an egg or can create a wheel distribution file, which can be used in submit spark command.

Simply pass a .py file in the place of , and add Python .zip, .egg or .py files to the search path with --py-files.

--py-files PY_FILES     Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.
--class CLASS_NAME      Your application's main class (for Java / Scala apps).
--name NAME             A name of your application.
--jars JARS             Comma-separated list of jars to include on the driver and executor classpaths.

Upvotes: 1

Sachit Murarka
Sachit Murarka

Reputation: 183

To work on Pyspark project you need setup.py . You may refer packaging Python application. In the setup.py , you will list the dependency and to create artifact you can create a wheel file. Then the wheel file can be a part of spark submit

Upvotes: 0

Related Questions