Jihyun
Jihyun

Reputation: 1105

Node Participation in Hadoop-Streaming with Python Scripts

I am working with Hadoop-Streaming version 3.4.0 for a data processing task (work count). I have two Python scripts: mapper.py and reducer.py. I want to ensure that all three nodes in the cluster are participating in the data processing task.

I have the following setup:

Hadoop-Streaming version: 3.4.0 Python scripts: mapper.py and reducer.py Cluster size: 3 nodes

Is there any method to monitor the activity of each node during the job's execution, and possibly to change the number of jobs running in each node?

mapper.py:

import sys
from collections import Counter
def main():
    for line in sys.stdin:
        words = line.split()
        for k,v in Counter(words).items():
            message = f"{k}\t{v}"
            print(message)
if __name__ == "__main__":
    main()

reducer.py:

import sys
from collections import defaultdict
def main():
    results = defaultdict(int)
    for line in sys.stdin:
        key,value = line.split("\t")
        results[key] += int(value)
    for key,value in results.items():
        print(f"{key}\t{value}")
if __name__ == "__main__":
    main()

I am running them on 3 node Hadoop cluster (1x namenode 2x datanodes)

hadoop@namenode:~$ time hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.4.0.jar -files /tmp/mapper.py,/tmp/reducer.py -mapper mapper.py -reducer reducer.py -input /testdata*.txt -output output
real    1m23.761s
user    0m4.455s
sys.    0m0.406s

hadoop@panda64:~$ time cat testdata*.txt | /tmp/mapper.py | /tmp/reducer.py >& /dev/null
real    1m39.284s
user    1m38.346s
sys 0m4.494s

hadoop@datanode1:~$ mapred job -list
Total jobs:1
JobId JobName State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
job_1712673362609_0004 streamjob8373769304891055217.jar RUNNING 1712691366231 hadoop root.default DEFAULT 7 0 8192M 0M 8192M http://panda64:8088/proxy/application_1712673362609_0004/

Upvotes: 0

Views: 21

Answers (0)

Related Questions