Reputation: 2352
I had run the following commands on pig on the google n-grams dataset:
inp = LOAD 'link to file' AS (ngram:chararray, year:int, occurences:float, books:float);
filter_input = FILTER inp BY (occurences >= 400) AND (books >= 8);
groupinp = GROUP filter_input BY ngram;
sum_occ = FOREACH groupinp GENERATE FLATTEN(group) as ngram, SUM(filter_input.occurences) / SUM(filter_input.books) AS ntry;
roundto = FOREACH sum_occ GENERATE sum_occ.ngram, ROUND_TO( sum_occ.ntry , 2 );
However I get the following error:
DUMP roundto;
601062 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_FLOAT 2 time(s).
18/04/06 01:46:03 WARN newplan.BaseOperatorPlan: Encountered Warning IMPLICIT_CAST_TO_FLOAT 2 time(s).
601067 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY,FILTER
18/04/06 01:46:03 INFO pigstats.ScriptState: Pig features used in the script: GROUP_BY,FILTER
601111 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
18/04/06 01:46:03 INFO data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
601111 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
18/04/06 01:46:03 INFO optimizer.LogicalPlanOptimizer: {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
601238 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - Tez staging directory is /tmp/temp-336429202 and resources directory is /tmp/temp-336429202
18/04/06 01:46:03 INFO tez.TezLauncher: Tez staging directory is /tmp/temp-336429202 and resources directory is /tmp/temp-336429202
601239 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.plan.TezCompiler - File concatenation threshold: 100 optimistic? false
18/04/06 01:46:03 INFO plan.TezCompiler: File concatenation threshold: 100 optimistic? false
601241 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.CombinerOptimizerUtil - Choosing to move algebraic foreach to combiner
18/04/06 01:46:03 INFO util.CombinerOptimizerUtil: Choosing to move algebraic foreach to combiner
601265 [main] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
18/04/06 01:46:03 INFO builtin.PigStorage: Using PigTextInputFormat
18/04/06 01:46:03 INFO input.FileInputFormat: Total input files to process : 1
601285 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
18/04/06 01:46:03 INFO util.MapRedUtil: Total input paths to process : 1
601285 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
18/04/06 01:46:03 INFO util.MapRedUtil: Total input paths (combined) to process : 1
18/04/06 01:46:03 INFO hadoop.MRInputHelpers: NumSplits: 1, SerializedSize: 408
601322 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: joda-time-2.9.4.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: joda-time-2.9.4.jar
601322 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: pig-0.17.0-core-h2.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: pig-0.17.0-core-h2.jar
601322 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: antlr-runtime-3.4.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: antlr-runtime-3.4.jar
601322 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: automaton-1.11-8.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: automaton-1.11-8.jar
601402 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - For vertex - scope-141: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
18/04/06 01:46:03 INFO tez.TezDagBuilder: For vertex - scope-141: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
601402 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Processing aliases: filter_input,groupinp,inp,sum_occ
18/04/06 01:46:03 INFO tez.TezDagBuilder: Processing aliases: filter_input,groupinp,inp,sum_occ
601402 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],sum_occ[4,10],groupinp[3,11]
18/04/06 01:46:03 INFO tez.TezDagBuilder: Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],sum_occ[4,10],groupinp[3,11]
601402 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Pig features in the vertex:
18/04/06 01:46:03 INFO tez.TezDagBuilder: Pig features in the vertex:
601449 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Set auto parallelism for vertex scope-142
18/04/06 01:46:03 INFO tez.TezDagBuilder: Set auto parallelism for vertex scope-142
601450 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - For vertex - scope-142: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
18/04/06 01:46:03 INFO tez.TezDagBuilder: For vertex - scope-142: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
601450 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Processing aliases: roundto,sum_occ
18/04/06 01:46:03 INFO tez.TezDagBuilder: Processing aliases: roundto,sum_occ
601450 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Detailed locations: sum_occ[4,10],roundto[6,10]
18/04/06 01:46:03 INFO tez.TezDagBuilder: Detailed locations: sum_occ[4,10],roundto[6,10]
601450 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Pig features in the vertex: GROUP_BY
18/04/06 01:46:03 INFO tez.TezDagBuilder: Pig features in the vertex: GROUP_BY
601489 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Total estimated parallelism is 2
18/04/06 01:46:04 INFO tez.TezJobCompiler: Total estimated parallelism is 2
601531 [PigTezLauncher-0] INFO org.apache.pig.tools.pigstats.tez.TezScriptState - Pig script settings are added to the job
18/04/06 01:46:04 INFO tez.TezScriptState: Pig script settings are added to the job
18/04/06 01:46:04 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.8.4, revision=300391394352b074b85b529e870816a72c6f314a, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2018-03-21T23:55:28Z ]
18/04/06 01:46:04 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-28-12.ec2.internal/172.31.28.12:8032
18/04/06 01:46:04 INFO client.TezClient: Using org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager to manage Timeline ACLs
18/04/06 01:46:04 INFO impl.TimelineClientImpl: Timeline service address: http://ip-172-31-28-12.ec2.internal:8188/ws/v1/timeline/
18/04/06 01:46:04 INFO client.TezClient: Session mode. Starting session.
18/04/06 01:46:04 INFO client.TezClientUtils: Using tez.lib.uris value from configuration: hdfs:///apps/tez/tez.tar.gz
18/04/06 01:46:04 INFO client.TezClientUtils: Using tez.lib.uris.classpath value from configuration: null
18/04/06 01:46:04 INFO client.TezClient: Tez system stage directory hdfs://ip-172-31-28-12.ec2.internal:8020/tmp/temp-336429202/.tez/application_1522978297921_0003 doesn't exist and is created
18/04/06 01:46:04 INFO acls.ATSHistoryACLPolicyManager: Created Timeline Domain for History ACLs, domainId=Tez_ATS_application_1522978297921_0003
18/04/06 01:46:04 INFO impl.YarnClientImpl: Submitted application application_1522978297921_0003
18/04/06 01:46:04 INFO client.TezClient: The url to track the Tez Session: http://ip-172-31-28-12.ec2.internal:20888/proxy/application_1522978297921_0003/
607861 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - Submitting DAG PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO tez.TezJob: Submitting DAG PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO client.TezClient: Submitting dag to TezSession, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003, dagName=PigLatin:DefaultJobName-0_scope-2, callerContext={ context=PIG, callerType=PIG_SCRIPT_ID, callerId=PIG-default-d73e19dc-5287-4ee2-a85d-e931327011dc }
18/04/06 01:46:10 INFO client.TezClient: Submitted dag to TezSession, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003, dagName=PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-28-12.ec2.internal/172.31.28.12:8032
608409 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - Submitted DAG PigLatin:DefaultJobName-0_scope-2. Application id: application_1522978297921_0003
18/04/06 01:46:10 INFO tez.TezJob: Submitted DAG PigLatin:DefaultJobName-0_scope-2. Application id: application_1522978297921_0003
608528 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - HadoopJobId: job_1522978297921_0003
18/04/06 01:46:11 INFO tez.TezLauncher: HadoopJobId: job_1522978297921_0003
609410 [Timer-1] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=null
18/04/06 01:46:11 INFO tez.TezJob: DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=null
629410 [Timer-1] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 1 Failed: 0 Killed: 0, diagnostics=, counters=null
18/04/06 01:46:31 INFO tez.TezJob: DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 1 Failed: 0 Killed: 0, diagnostics=, counters=null
646404 [pool-1-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Shutting down Tez session org.apache.tez.client.TezClient@3a371843
18/04/06 01:46:48 INFO tez.TezSessionManager: Shutting down Tez session org.apache.tez.client.TezClient@3a371843
2018-04-06 01:46:48 Shutting down Tez session , sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003
18/04/06 01:46:48 INFO client.TezClient: Shutting down Tez Session, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003
How do I fix this error? Dump commands work for the previous lines other than roundto. And What exactly is the Tez client?
Upvotes: 1
Views: 845
Reputation: 722
I can't replicate your output, because I get an error as soon as I try this line:
roundto = FOREACH sum_occ GENERATE sum_occ.ngram, ROUND_TO( sum_occ.ntry , 2 );
You don't need to use the dot operator to refer to these fields (e.g. sum_occ.ngram
) because they are not nested in a tuple or bag. Try the above line without the dot operator:
roundto = FOREACH sum_occ GENERATE ngram, ROUND_TO( ntry , 2 );
To answer your second question, MapReduce and Tez are both frameworks that can be used to run Pig scripts. Tez can sometimes speed up the time it takes Pig scripts to run. You can explicitly use MapReduce or Tez by starting your Pig shell with pig -x mapreduce
or pig -x tez
. MapReduce is the default, so if you haven't specified Tez, your Hadoop cluster must be set up to run Pig in Tez.
Upvotes: 1