Reputation: 6093
I am processing a file with 7+ million lines (~59 MB) in Ubuntu 11.04 machine with this configuration:
Intel(R) Core(TM)2 Duo CPU E8135 @ 2.66GHz, 2280 MHz Memory: 2GB Disk: 100GB
Even after running for 45 Minutes, I didn't see any progress.
Deleted hdfs://localhost:9000/user/hadoop_admin/output packageJobJar: [/home/hadoop_admin/Documents/NLP/Dictionary/dict/drugs.csv, /usr/local/hadoop/mapper.py, /usr/local/hadoop/reducer.py, /tmp/hadoop-hadoop_admin/hadoop-unjar8773176795802479000/] [] /tmp/streamjob582836411271840475.jar tmpDir=null 11/07/22 10:39:20 INFO mapred.FileInputFormat: Total input paths to process : 1 11/07/22 10:39:21 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hadoop_admin/mapred/local] 11/07/22 10:39:21 INFO streaming.StreamJob: Running job: job_201107181559_0099 11/07/22 10:39:21 INFO streaming.StreamJob: To kill this job, run: 11/07/22 10:39:21 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201107181559_0099 11/07/22 10:39:21 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201107181559_0099 11/07/22 10:39:22 INFO streaming.StreamJob: map 0% reduce 0%
What is the maximum possible file size that can be processed using Hadoop
in pseudo distributed
mode.
Updated:
I am doing a simple wordcount application using Hadoop Streaming
. My mapper.py
and reducer.py
took around 50 Sec
to process a file with 220K lines (~19MB).
Upvotes: 0
Views: 560
Reputation: 10642
The size limit is really dictated by the size of the available storage you have. To give you an idea I've been doing processing of input files that are several GiB in size(gzip compressed apache logfiles) on a single node for quite some time now. The only real limitation is how much time does it take and if this is fast enough for you.
Upvotes: 0
Reputation: 6093
Problem solved, I didn't kill the previous jobs so this job joined the queue, thats why it got delayed. I Used
bin/hadoop -kill <job_id>
to kill all the pending jobs. It took ~140 Sec
to process the whole file (~59 MB) in pseudo distributed mode
Upvotes: 0