Reputation: 4967
I'm trying to convert an HDFS file from UTF-8
to ISO-8859-1
.
I've written a small Java program :
String theInputFileName="my-utf8-input-file.csv";
String theOutputFileName="my-iso8859-output-file.csv";
Charset inputCharset = StandardCharsets.UTF_8;
Charset outputCharset = StandardCharsets.ISO_8859_1;
try (
final FSDataInputStream in = theFileSystem.open(new Path(theInputFileName)) ;
final FSDataOutputStream out = theFileSystem.create(new Path(theOutputFileName))
)
{
try (final BufferedReader reader = new BufferedReader(new InputStreamReader(in, inputCharset)))
{
String line;
while ((line = reader.readLine()) != null)
{
out.write(line.getBytes(this.outputCharset));
out.write(this.lineSeparator.getBytes(this.outputCharset));
}
}
} catch (IllegalArgumentException | IOException e)
{
RddFileWriter.LOGGER.error(e, "Exception on file '%s'", theFileNameOutput);
}
This code is executed through a Hadoop Cluster using Spark
(the output data is usually provided by a RDD)
To simplify my issue I have removed RDD/Datasets parts to work direcly on HDFS File.
When I execute the code :
ISO-8859-1
ISO-8859-1
UTF-8
instead of ISO-8859-1
I don't understand what properties (or something else) may be causing the change in behavior
Versions :
Looking forward to your help. Thanks in advance
Upvotes: 3
Views: 1190
Reputation: 4967
Finally, I found the source of my problem.
The input file on the cluster was corrupted, the whole file did not have a constant and consistent encoding.
External data are aggregated daily and recently the encoding has been changed from ISO to UTF8 without notification...
To put it more simply:
We have split, fixed the encoding and merged the data to repair the input.
The final code works fine.
private void changeEncoding(
final Path thePathInputFileName,final Path thePathOutputFileName,
final Charset theInputCharset, final Charset theOutputCharset,
final String theLineSeparator
) {
try (
final FSDataInputStream in = this.fileSystem.open(thePathInputFileName);
final FSDataOutputStream out = this.fileSystem.create(thePathOutputFileName);
final BufferedReader reader = new BufferedReader(new InputStreamReader(in, theInputCharset));
final BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(out, theOutputCharset));) {
String line;
while ((line = reader.readLine()) != null) {
writer.write(line);
writer.write(theLineSeparator);
}
} catch (IllegalArgumentException | IOException e) {
LOGGER.error(e, "Exception on file '%s'", theOutputFileName);
}
}
Stop your research ! ;-)
Upvotes: 1