Robert Margeson
Robert Margeson

Reputation: 23

Google Cloud Dataflow unable to find parent directory in Google Storage

When running a Dataflow Pipeline in Streaming mode, java.io.IOException is Unable to find parent directory in Google Storage.

I don't believe this is a permission issue because I can view the file using gsutil ls and this works when the Pipeline is executed in Batch mode.

gsutil ls gs://my-bucket/*.txt
gs://my-bucket/product-names.txt

Any idea what would cause this?

    [WARNING] 
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to validate gs://my-bucket/*.txt
--streaming
    at com.google.cloud.dataflow.sdk.io.TextIO$Read$Bound.apply(TextIO.java:328)
    at com.google.cloud.dataflow.sdk.io.TextIO$Read$Bound.apply(TextIO.java:213)
    at com.google.cloud.dataflow.sdk.runners.PipelineRunner.apply(PipelineRunner.java:75)
    at com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.apply(DataflowPipelineRunner.java:474)
    at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:368)
    at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:275)
    at com.google.cloud.dataflow.sdk.values.PBegin.apply(PBegin.java:47)
    at com.google.cloud.dataflow.sdk.Pipeline.apply(Pipeline.java:157)
    at com.google.cloud.dataflow.examples.complete.AutoComplete.main(AutoComplete.java:287)
    ... 6 more
Caused by: java.io.IOException: Unable to find parent directory of gs://my-bucket/*.txt
--streaming
    at com.google.cloud.dataflow.sdk.util.FileIOChannelFactory.match(FileIOChannelFactory.java:59)
    at com.google.cloud.dataflow.sdk.io.TextIO$Read$Bound.apply(TextIO.java:323)
    ... 14 more
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3.784 s
[INFO] Finished at: 2017-10-11T10:19:05-04:00
[INFO] Final Memory: 30M/541M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project google-cloud-dataflow-java-examples-all: An exception occured while executing the Java class. null: InvocationTargetException: Failed to validate gs://my-bucket/*.txt
[ERROR] --streaming: Unable to find parent directory of gs://my-bucket/*.txt
[ERROR] --streaming
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project google-cloud-dataflow-java-examples-all: An exception occured while executing the Java class. null
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
    at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
    at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
    at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
    at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.MojoExecutionException: An exception occured while executing the Java class. null
    at org.codehaus.mojo.exec.ExecJavaMojo.execute(ExecJavaMojo.java:345)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
    ... 20 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to validate gs://my-bucket/*.txt
--streaming
    at com.google.cloud.dataflow.sdk.io.TextIO$Read$Bound.apply(TextIO.java:328)
    at com.google.cloud.dataflow.sdk.io.TextIO$Read$Bound.apply(TextIO.java:213)
    at com.google.cloud.dataflow.sdk.runners.PipelineRunner.apply(PipelineRunner.java:75)
    at com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.apply(DataflowPipelineRunner.java:474)
    at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:368)
    at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:275)
    at com.google.cloud.dataflow.sdk.values.PBegin.apply(PBegin.java:47)
    at com.google.cloud.dataflow.sdk.Pipeline.apply(Pipeline.java:157)
    at com.google.cloud.dataflow.examples.complete.AutoComplete.main(AutoComplete.java:287)
    ... 6 more
Caused by: java.io.IOException: Unable to find parent directory of gs://my-bucket/*.txt
--streaming
    at com.google.cloud.dataflow.sdk.util.FileIOChannelFactory.match(FileIOChannelFactory.java:59)
    at com.google.cloud.dataflow.sdk.io.TextIO$Read$Bound.apply(TextIO.java:323)
    ... 14 more
[ERROR] 
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Upvotes: 0

Views: 363

Answers (1)

Ben Chambers
Ben Chambers

Reputation: 6130

It looks like the implementation of TextIO requires that the glob (*.txt) be located within a directory, not a bucket. So gs://my-bucket/object/*.txt should work. Specifically, it requires that the glob has a parent object (the "parent" here would be gs://my-bucket/object).

Upvotes: 1

Related Questions