Reputation: 3062
I have 2 scripts, a mapper and a reducer. Both are taking input from the csv reader. The mapper script should take its input from a tab-delimited text file, dataset.csv, the input to the reducer should be the output to the mapper. I want to save the output of the reducer to a text file, output.txt. What is the correct chain of commands to do it?
mapper:
#/usr/bin/python
import sys, csv
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
for line in reader:
if len(line) > 5: # parse only lines in the forum_node.tsv file
if line[5] == 'question':
_id = line[0]
student = line[3] # author_id
elif line[5] != 'node_type':
_id = line[7]
student = line[3] # author_id
else:
continue # ignore header
print '{0}\t{1}'.format(_id, student)
reducer:
#/usr/bin/python
import sys, csv
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
oldID = None
students = []
for line in reader:
if len(line) != 2:
continue
thisID, thisStudent = data
if oldID and oldID != thisID:
print 'Thread: {0}, students: {1}'.format(oldID, ', '.join(students))
students = []
thisID = oldID
students.append(thisStudent)
if oldID != None:
print 'Thread: {0}, students: {1}'.format(oldID, ', '.join(students))
Upvotes: 2
Views: 2168
Reputation: 1123480
Pipe the files together:
python mapper.py < dataset.csv | python reducer.py > output.txt
The < dataset.csv
gives mapper.py
the CSV file on stdin
, and the |
redirects the stdout to another commend. That other command is python reducer.py
, and > output.txt
connects the stdout
from that script to `output.txt.
Upvotes: 4