justjack2000
justjack2000

Reputation: 1

Is there any benefit to reducing size of Uber JAR for Spark?

I noticed in our company we're building an Uber Jar for our Spark jobs, which has a size of roughly 1.5Gb. Looking into the Jar, there are lots of resources we don't need (mostly coming from ML libraries). Is there any benefit in investing time to reduce the size of this Jar?

Upvotes: -1

Views: 23

Answers (1)

Chris
Chris

Reputation: 2841

There can be significant benefit depending on how you distribute it and what your runtime is, none of which is mentioned in the question. The larger a file is the longer it takes to distribute and load, if this happens for every start of your cluster you'll get a slowdown (again dependent on how it's loaded etc. and the impact on how long data processing takes afterwards etc.).

It's quite possible your jobs spend so much time that you don't even notice the load time.

That said if you are packaging with maven you could try using minimizeJar:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>${mavenShadePluginVersion}</version>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
            <configuration>
                <minimizeJar>true</minimizeJar>
                <artifactSet>

if it works for you it's pretty straight forward. You should definitely ensure you spend time on trimming your artifactSet anyway as it reduces possible classpath issues with your runtime (also look at renaming the packages to further isolate your jars classes). For some of my projects it doesn't work though and cuts too many class files.

Upvotes: 0

Related Questions