Reputation: 1105
I am working with Hadoop-Streaming version 3.4.0 for a data processing task (work count). I have two Python scripts: mapper.py and reducer.py. I want to ensure that all three nodes in the cluster are participating in the data processing task.
I have the following setup:
Hadoop-Streaming version: 3.4.0 Python scripts: mapper.py and reducer.py Cluster size: 3 nodes
Is there any method to monitor the activity of each node during the job's execution, and possibly to change the number of jobs running in each node?
mapper.py:
import sys
from collections import Counter
def main():
for line in sys.stdin:
words = line.split()
for k,v in Counter(words).items():
message = f"{k}\t{v}"
print(message)
if __name__ == "__main__":
main()
reducer.py:
import sys
from collections import defaultdict
def main():
results = defaultdict(int)
for line in sys.stdin:
key,value = line.split("\t")
results[key] += int(value)
for key,value in results.items():
print(f"{key}\t{value}")
if __name__ == "__main__":
main()
I am running them on 3 node Hadoop cluster (1x namenode 2x datanodes)
hadoop@namenode:~$ time hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.4.0.jar -files /tmp/mapper.py,/tmp/reducer.py -mapper mapper.py -reducer reducer.py -input /testdata*.txt -output output
real 1m23.761s
user 0m4.455s
sys. 0m0.406s
hadoop@panda64:~$ time cat testdata*.txt | /tmp/mapper.py | /tmp/reducer.py >& /dev/null
real 1m39.284s
user 1m38.346s
sys 0m4.494s
hadoop@datanode1:~$ mapred job -list
Total jobs:1
JobId JobName State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
job_1712673362609_0004 streamjob8373769304891055217.jar RUNNING 1712691366231 hadoop root.default DEFAULT 7 0 8192M 0M 8192M http://panda64:8088/proxy/application_1712673362609_0004/
Upvotes: 0
Views: 21