Maggie
Maggie

Reputation: 6093

Maximum file size that can be processed using Hadoop in 'pseudo distributed' mode

I am processing a file with 7+ million lines (~59 MB) in Ubuntu 11.04 machine with this configuration:

Intel(R) Core(TM)2 Duo CPU     E8135  @ 2.66GHz, 2280 MHz
Memory: 2GB
Disk: 100GB

Even after running for 45 Minutes, I didn't see any progress.

Deleted hdfs://localhost:9000/user/hadoop_admin/output
packageJobJar: [/home/hadoop_admin/Documents/NLP/Dictionary/dict/drugs.csv, /usr/local/hadoop/mapper.py, /usr/local/hadoop/reducer.py, /tmp/hadoop-hadoop_admin/hadoop-unjar8773176795802479000/] [] /tmp/streamjob582836411271840475.jar tmpDir=null
11/07/22 10:39:20 INFO mapred.FileInputFormat: Total input paths to process : 1
11/07/22 10:39:21 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hadoop_admin/mapred/local]
11/07/22 10:39:21 INFO streaming.StreamJob: Running job: job_201107181559_0099
11/07/22 10:39:21 INFO streaming.StreamJob: To kill this job, run:
11/07/22 10:39:21 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201107181559_0099
11/07/22 10:39:21 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201107181559_0099
11/07/22 10:39:22 INFO streaming.StreamJob:  map 0%  reduce 0%

What is the maximum possible file size that can be processed using Hadoop in pseudo distributed mode.

Updated:

I am doing a simple wordcount application using Hadoop Streaming. My mapper.py and reducer.py took around 50 Sec to process a file with 220K lines (~19MB).

Upvotes: 0

Views: 560

Answers (2)

Niels Basjes
Niels Basjes

Reputation: 10642

The size limit is really dictated by the size of the available storage you have. To give you an idea I've been doing processing of input files that are several GiB in size(gzip compressed apache logfiles) on a single node for quite some time now. The only real limitation is how much time does it take and if this is fast enough for you.

Upvotes: 0

Maggie
Maggie

Reputation: 6093

Problem solved, I didn't kill the previous jobs so this job joined the queue, thats why it got delayed. I Used bin/hadoop -kill <job_id> to kill all the pending jobs. It took ~140 Sec to process the whole file (~59 MB) in pseudo distributed mode

Upvotes: 0

Related Questions