Reputation: 75
I am new to Apache Beam and trying to run a sample read and write program using DirectRunner and DataflowRunner. In my use case, there are few CLI args and to achieve this I created one interface "CustomOptions.java" which extends PipelineOptions.
Using DirectRunner the programs runs fine but with DataflowRunner, it says "interface CustomOptions missing a property named 'project'".
pom.xml
<dependencies>
<dependency>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.2.0</version>
<type>maven-plugin</type>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-core</artifactId>
<version>2.16.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
<version>2.16.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-direct-java</artifactId>
<version>2.16.0</version>
</dependency>
</dependencies>
CustomOptions.java (Interface)
import org.apache.beam.sdk.options.PipelineOptions;
public interface CustomOptions extends PipelineOptions {
String getInput();
void setInput(String value);
String getOutput();
void setOutput(String value);
}
WordCount.java
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
public class WordCount {
public static void main(String args[]) {
PipelineOptionsFactory.register(CustomOptions.class);
CustomOptions options = PipelineOptionsFactory.fromArgs(args).as(CustomOptions.class);
Pipeline p = Pipeline.create(options);
p.apply("Read", TextIO.read().from(options.getInput()))
.apply("Write", TextIO.write().to(options.getOutput()));
p.run();
}
}
Commands:
DirectRunner (Working) : java -cp jarPath WordCount --input=inputPath --output=outputPath
DataflowRunner (Not Working) : java -cp jarPath WordCount --input=inputPath --output=outputPath --runner=DataflowRunner --stagingLocation=gs://<tmp_path> --project=<projectId>
Error:
Exception in thread "main" java.lang.IllegalArgumentException: Class interface CustomOptions missing a property named 'project'.
at org.apache.beam.sdk.options.PipelineOptionsFactory.parseObjects(PipelineOptionsFactory.java:1625)
at org.apache.beam.sdk.options.PipelineOptionsFactory.access$400(PipelineOptionsFactory.java:115)
at org.apache.beam.sdk.options.PipelineOptionsFactory$Builder.as(PipelineOptionsFactory.java:298)
at WordCount.main(WordCount.java:13)
Second thing that i tried is to extend CustomOptions with DataflowPipelineOptions instead of PipelineOptions. Using this also, i am getting an error:
Exception in thread "main" java.lang.IllegalArgumentException: No filesystem found for scheme gs
at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:463)
at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:533)
at org.apache.beam.sdk.io.FileBasedSink.convertToFileResourceIfPossible(FileBasedSink.java:215)
at org.apache.beam.sdk.io.TextIO$TypedWrite.to(TextIO.java:734)
at org.apache.beam.sdk.io.TextIO$Write.to(TextIO.java:1069)
at WordCount.main(WordCount.java:15)
Second trial comes with one more question that same code can not be executed using DirectRunner and DataflowRunner. Because in second case "projectId" is a mandatory argument which will not be specified in DirectRunner.
Upvotes: 0
Views: 1401
Reputation: 75
With few trials and errors, I think I got the right thing. I am using same java classes as mentioned in the question, i.e. extending CustomOptions.java with PipelineOptions. Only change that I did was in pom.xml.
Now I am using maven shade plugin with few extra configuration instead of maven assembly plugin. With these what I achieved: 1. Same jar can be used with DirectRunner or DataflowRunner. 2. Stating which main class I want to execute from command line.
Previous 'pom.xml':
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.2.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- bind to the packaging phase -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
<!-- add Main-Class to manifest file -->
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.dh.WordCount</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.2.0</version>
<type>maven-plugin</type>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-core</artifactId>
<version>2.16.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
<version>2.16.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-direct-java</artifactId>
<version>2.16.0</version>
</dependency>
</dependencies>
New 'pom.xml':
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"/>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-core</artifactId>
<version>2.16.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
<version>2.16.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-direct-java</artifactId>
<version>2.16.0</version>
</dependency>
</dependencies>
This was made possible when I read this answer: Google Dataflow "No filesystem found for scheme gs"
Upvotes: 3