MystikDan
MystikDan

Reputation: 1

How to read files that use unsupported encodings and/or charsets in Java

I need to read a CSV file into a Java application, but the file is encoded using Western (Mac OS Roman), which is unsupported in Java.

It's been suggested I use Byte Stream to read in the text and convert everything over 128 to the space character (ASCII character 32). But I have no idea how to do this. I don't know how to deal with each byte at a time, how to convert them, and when I've reached the end of the line how to then take that line of "truncated" text, split it into an array, and then pull the data out of the indexes I need.

SortedMap<String, OBJ_NAME> mapResults = new TreeMap<String, OBJ_NAME>();
String url = 'url-to-file';
InputStream inputStream = null;
InputStreamReader = null;
CSVReader = csvReader = null;
final Pattern regexPattern = Pattern.compile("^\\d{2}\\.\\d{1.3}$");

try {
    inputStream = new URL(url).openStream();

    reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
    csvReader = new CSVReader(reader, ',', '"', 1);
    List<String[]> lines = csvReacer.readAll();

    for (String[] line : lines) {
        // logic to grab data from first and second indices of the line
        OBJ_NAME objInstance = new OBJ_NAME();

        objInstance.setFieldOne(line[0]);
        objInstance.setFieldTwo(line[1]);
        mapResults.put(line[1], objInstance);
    }
} catch (Exception e) {
    throw new IOException(e);
} finally {
    // IOUtils from apache commons
    IOUtils.closeQuietly(inputStream);
    IOUtils.closeQuietly(reader);
    IOUtils.closeQuietly(csvReader);
}

Because the CSV is using an unsupported format, the logic above is reading the data wrong since it's not UTF-8, and so I'm getting far fewer results than I should. I'm not sure if I should input it as ASCII and "interrupt" characters over 128 (which I don't know how to do), or do it with Byte Stream instead (which I also don't know how to do).

Help? And also, screw anyone who releases documents with official information in outdated, unsupported encodings.

Upvotes: 0

Views: 214

Answers (0)

Related Questions