Reputation: 31
I have been seeing many tools like syncsort, informatica and etc which are efficient enough to convert EBCDIC mainframe files to ASCII. Since our company is a small in size and dont want to invest on any of the tools, i have a challange to convert EBCDIC mainframe files to ASCII. The upstream are mainframe and i am migrating the entire data into hdfs but since hdfs in not efficient enough to handle mainframe i have been asked to convert with Spark/java rode routine to convert these mainframe EBCDIC files. I understand that when the file is exported, the files gets converted to ASCII but packed decimal, COMP/COMP3 doesnt get converted. i need to write a logic to convert these mainframe EBCDIC partially converted file to ASCII so that we can do our further processing in hadoop. Since iam new in this site and cant even add my sample ebcdic file. request you to consider below as a sample file content which contains ascii as well as junk characters the below contains some junk which is after salary field, that is Dept field, it is having COMP data type..below is the emp.txt file
101GANESH 10000á?
102RAMESH 20000€
103NAGESH 40000€
below is empcopybook
01 EMPLOYEE-DETAILS.
05 EMP-ID PIC 9(03).
05 EMP-NAME PIC X(10).
05 EMP-SAL PIC 9(05).
05 DEPT PIC 9(3) COMP-3.
Upvotes: 3
Views: 3928
Reputation: 1333
You can use Cobrix, which is a COBOL data source for Spark. It is open-source.
You can use Spark to load the files, parse the records and store them in any format you want, including plain text, which seems to be what you are looking for.
DISCLAIMER: I work for ABSA and I am one of the developers behind this library. Our focus is on 1) ease of use, 2) performance.
Upvotes: 1
Reputation: 10543
There is also this option (it also uses JRecord):
it is based on CopybookHadoop which looks to be a clone of CopybookInputFormat that Thiago has mentioned.
Any way from the Documentation:
This example reads data from a local binary file "file:///home/cdap/DTAR020_FB.bin" and parses it using the schema given in the text area "COBOL Copybook" It will drop field "DTAR020-DATE" and generate structured records with schema as specified in the text area.
{
"name": "CopybookReader",
"plugin": {
"name": "CopybookReader",
"type": "batchsource",
"properties": {
"drop" : "DTAR020-DATE",
"referenceName": "Copybook",
"copybookContents":
"000100* \n
000200* DTAR020 IS THE OUTPUT FROM DTAB020 FROM THE IML \n
000300* CENTRAL REPORTING SYSTEM \n
000400* \n
000500* CREATED BY BRUCE ARTHUR 19/12/90 \n
000600* \n
000700* RECORD LENGTH IS 27. \n
000800* \n
000900 03 DTAR020-KCODE-STORE-KEY. \n
001000 05 DTAR020-KEYCODE-NO PIC X(08). \n
001100 05 DTAR020-STORE-NO PIC S9(03) COMP-3. \n
001200 03 DTAR020-DATE PIC S9(07) COMP-3. \n
001300 03 DTAR020-DEPT-NO PIC S9(03) COMP-3. \n
001400 03 DTAR020-QTY-SOLD PIC S9(9) COMP-3. \n
001500 03 DTAR020-SALE-PRICE PIC S9(9)V99 COMP-3. ",
"binaryFilePath": "file:///home/cdap/DTAR020_FB.bin",
"maxSplitSize": "5"
}
}
}
Upvotes: 1
Reputation: 7742
There is a library in Java that you can use with spark is called JRecord to convert the binary files of EBCDIC to ASCII.
The code you can find with this guy here
This is possible to integrate with Scala with the function newAPIHadoopFile
to run it in spark. This code is a Hadoop coding but will work fine with spark.
Upvotes: 1