Sudipto Dutta
Sudipto Dutta

Reputation: 25

Deduplication of records without sorting in a mainframe sequential dataset with already sorted data

This is a query on deduplicating an already sorted mainframe dataset without re-sorting it.

The input sequential dataset has the following structure. 'KEYn' in the first 4 bytes represents the key and the remainder of each row represents the rest of the record's data. There are records in which the same key is repeated though the remaining data is different in each record. The records are already sorted on 'KEYn'.

KEY1aaaaaa

KEY1bbbbbb

KEY2cccccc

KEY3xxxxxx

KEY3yyyyyy

KEY3zzzzzz

KEY3wwwwww

KEY4uuuuuu

KEY5hhhhhh

KEY5ffffff

My requirement is to pick up the first record of each key and drop the remaining 'duplicates'. so the output file for the above input should look like this:

KEY1aaaaaa

KEY2cccccc

KEY3xxxxxx

KEY4uuuuuu

KEY5hhhhhh

Since the data is already sorted, I don't want to use SORT utility with SUM FIELDS=NONE or ICETOOL with SELECT - FIRST operand since both of these will actually end up re-sorting the data on the deduplication key (KEYn). Also the actual dataset I am referring to is huge (1.6 billion records, AVGRLEN 900 VB) and a job actually ran out of sort work space trying to sort it in one go.

My query is: Is there any option available in JCL based utilities to do this deduplication without resorting and using sort work space? I am trying to avoid writing a COBOL/Assembler program to do this.

Upvotes: 0

Views: 306

Answers (1)

MageshJ
MageshJ

Reputation: 33

Try this untested.

OPTION COPY                                                         
INREC BUILD=(1,4,SEQNUM,3,ZD,RESTART=(5,4),5)    
OUTFIL INCLUDE=(5,3,ZD,EQ,1),BUILD=(1,4,8)

Upvotes: 0

Related Questions