Reputation: 131
I have a requirement to read a hive db table and write that information in text format in EBCDIC encoding as that will be used as an input to a mainframe process. We are using the cp037 encoding. To do that i am reading the hive table into a df and then calling a custom udf to convert the numeric value into COMP3 format(used to store numeric value in Mainframe) and then decoding them using cp037 for mainframe EBCDIC charset. Problem is while writing into a text file or displaying the value some extra junk characters are getting added.
First i made sure that the custom udf is working fine. For that i encoded back the output of the custom udf using cp037 and the below output is coming which is perfect and expected. newcol1 here is the MF comp3 equivalent of the field1 and newcol2 is the MF COMP3 equivalent of the field2.
root
|-- field1: long (nullable = true)
|-- newcol1: binary (nullable = true)
|-- field2: long (nullable = true)
|-- newcol2: binary (nullable = true)
+---------------+----------------------+---------------------------+----------------+
|field1 |newcol1 |field2 |newcol2 |
+---------------+----------------------+---------------------------+----------------+
|2023097000 |[00 02 02 30 97 00 0C]|320264 |[00 03 20 26 4F]|
|2023097000 |[00 02 02 30 97 00 0C]|343012 |[00 03 43 01 2F]|
|2023100000 |[00 02 02 31 00 00 0C]|343012 |[00 03 43 01 2F]|
|2023100000 |[00 02 02 31 00 00 0C]|320264 |[00 03 20 26 4F]|
+---------------+----------------------+---------------------------+----------------+
But when i am trying to write/display the same after doing decoding junk charcters are coming in
after doing decoding
root
|-- field1: long (nullable = true)
|-- newcol11: string (nullable = true)
|-- field2: long (nullable = true)
|-- newcol21: string (nullable = true)
+---------------+--------+---------------------------+--------+
|field1 |newcol11|field2 |newcol21|
+---------------+--------+---------------------------+--------+
|2023097000 |^@^B^B~Pp^@^L |320264 |^@^C~@^W| |
|2023097000 |^@^B^B~Pp^@^L |343012 |^@^C?^A^G |
|2023100000 |^@^B^B~Q^@^@^L |343012 |^@^C?^A^G |
|2023100000 |^@^B^B~Q^@^@^L |320264 |^@^C~@^W| |
+---------------+--------+---------------------------+--------+
The Mainframe file is like below (Note that i have put HEX ON to show the actual nibbiles)
********************************
-------------------------------
2023097000...Ì.p..320264..Ì..|
FFFFFFFFFF00073900FFFFFF00722444
20230970000228070C32026403806F00
-------------------------------
If you notice the byte X'78' is the junk character that is getting added.
the pyspark file write command i am using is df.coalesce(1).write.format('text').option("encoding","cp037").mode('overwrite').save('/some_location/test/comp3_outputdata/')
Seems like while writing into text format it is not supporting the encoding option.
According to the below link it seems like pyspark doesnt support encoding while calling text method.
Pass encoding option in Pyspark text method
i can't afford to have any other format as my output file will be directly picked up by a mainframe process. Can somebody please help me with this problem ??
Upvotes: 1
Views: 976
Reputation: 131
In solving this particular problem statement we have encountered multiple issues and as such writing here a detail process that we followed to resolve the issue for any future reference on this
https://www.ibm.com/docs/en/cpl?topic=zos-mbcs-conversion-during-microsoft-windows-copy
Problem 2 -> In code page cp037, X'15' and X'25' are interchangeable both indicating NewLine. The issue that we faced is that if there is a hex byte of x'15' as part of a comp3 field that value was getting change to x'25'. We did a lot of research and seems like there is no straightforward way to solve it and it's kind of bug in this particular codepage. We discussed with IBM on this and they suggested us to use the code page IBM1047. Note @Hogstrom mentioned this already in his response but somehow i didn't notice that. Note the entire file encoding was still utf8 but the destination encoding is changed from cp037 to cp1047.
Problem 3 -> This time we noticed that in each line some extra white spaces are getting added. There is no definite logic for that...for example there was no extra space being added for 2.7 but for 2.5 extra character is getting added. This resulted into a layout issue for us though all the data related issues are resolved. To solved this we forcefully encoded the entire pyspark output file from utf8 to cp1047. Since in pyspark i was writing in text mode and encoding is not supported in that API we have written a separate python code all together for that
Problem 4 -> At this point, we noticed that we are getting an extra character x'15' is being added at the end of each rec. This was obvious as the pyspark text write API uses \n as default line-seperator. To resolve this we have used a custom line separator and in the following python code used rstrip to remove that custom line-seperator. this fixed the issue
Problem 5 -> Finally in the MBCS sft profile setting, the output file will always be on VB mode while in our case we need to support both VB and FB. As such we have removed the MBCS SFT profile and used simple SFT profile with mode=binary. That satisfied the requirement and created perfect file with perfect layout.
I hope the above steps will help anybody with anytype of issue they might face working on similar requirements.
Upvotes: 1
Reputation: 3761
COMP-3 is a term used in COBOL to refer to packed decimal format on IBM Z hardware. Other formats include COMP which is binary. Think of that as an int32.
In the decimal format there are several decimal instructions that can process the number for various arithmetic operations. It's not clear what you want to do in your question but since you refer to COMP-3 its a reasonable assumption that you want to have a COBOL program process your numeric data as packed decimal and textual information using CP-037 codepage. A side note, CP-037 is generally used in European systems and CP-1047 is used in North American systems.
The binary structure of packed decimal is unrelated to the Code Page.
In your case, you'll need to process each column based on its data type. In your case you'll have something like this based on your example:
field1
- str
newcol1
- comp-3
field2
- str
newcol2
- comp-3
Rather than converting the entire record using a CP 037 encoding (which is only valid for field1 and field2) you would need to convert field1 and field2 with the conversion. For newcol1 and newcol2 you want to leave them untouched. Either copy them to the new output record or leave them untouched. Since they are binary encoding will result in funky translations like you are seeing with x'78'.
Upvotes: 1