Apache beam spark /flink runner not getting executed in EMR(Access files from GCS)

Question

I have an apache beam pipeline to index some data to elasticsearch. I was trying to use spark or Flink runner to run the job in AWS EMR. When I tried to run the job on a stand-alone spark on local setup, pipeline works with source files in the local disk, however, when I read the file from GCS it's not working. It is the same when I am running in the EMR cluster.

The configs that I set on the Hadoop core-site.xml as EMR config

{
    "Classification": "core-site",
    "Properties": {
      "fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
      "fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
      "fs.gs.project.id": "data-warehouse",
      "google.cloud.auth.service.account.enable": "true",
      "fs.gs.auth.service.account.json.keyfile": "/home/hadoop/utils/key.json"
    }
  }

Also, GCS-connector jar is in the spark jar path and hadoop jar path

The pom file of the maven for the pipeline


  4.0.0

  com.company.beam
  IndexToEs
  1.0-SNAPSHOT

  
    2.22.0
    3.7.0
    1.6.0
    1.7.25
  

  
    
      apache.snapshots
      Apache Development Snapshot Repository
      https://repository.apache.org/content/repositories/snapshots/
      
        false
      
      
        true
      
    
  

  
    
      
        org.apache.maven.plugins
        maven-compiler-plugin
        ${maven-compiler-plugin.version}
        
          8
          8
        
      
      
        org.apache.maven.plugins
        maven-shade-plugin
        3.2.3
        
          
            
              shade
            
            
              false
              
                
                
                  com.company.beam.IndexToEs
                
              
              
                
                  *:*
                  
                    META-INF/*.SF
                    META-INF/*.DSA
                    META-INF/*.RSA
                  
                
              
            
          
        
      
    

    
      
        
          org.codehaus.mojo
          exec-maven-plugin
          ${maven-exec-plugin.version}
          
            false
          
        

      

    
  

  
    
      org.apache.beam
      beam-sdks-java-core
      ${beam.version}
    
    
      org.apache.beam
      beam-sdks-java-io-elasticsearch
      ${beam.version}
    
    
      org.apache.beam
      beam-sdks-java-io-google-cloud-platform
      ${beam.version}
    

    
      org.apache.beam
      beam-sdks-java-extensions-google-cloud-platform-core
      ${beam.version}
    

    
      org.apache.beam
      beam-runners-google-cloud-dataflow-java
      ${beam.version}
    
    
      org.apache.beam
      beam-runners-direct-java
      ${beam.version}

    
    
      org.apache.beam
      beam-runners-spark
      ${beam.version}
    
    
    
      org.apache.beam
      beam-runners-flink_2.11
      2.16.0
    
    
      com.google.cloud.bigdataoss
      gcs-connector
      hadoop2-1.9.17
    
    
    
      org.apache.spark
      spark-core_2.11
      2.1.3
    
    
      org.apache.spark
      spark-streaming_2.11
      2.1.3

There is no error but EMR shows task com[pleted but the pipeline has not run.

I could not figure out if its an apache beam problem or cluster config problem.

Apache beam spark /flink runner not getting executed in EMR(Access files from GCS)

Answers (1)

Related Questions