Reputation: 15927

Java read file got a leading BOM [ ï»¿ ]

I am reading a file containing keywords line by line and found a strange problem. I hope lines that following each other if their contents are the same, they should be handled only once. Like

sony
sony

only the first one is getting processed. but the problems is, java doesn't treat them as equals.

INFO: [, s, o, n, y]
INFO: [s, o, n, y]

My code looks like the following, where's the problem?

    FileReader fileReader = new FileReader("some_file.txt");
    BufferedReader bufferedReader = new BufferedReader(fileReader);
    String prevLine = "";
    String strLine
    while ((strLine = bufferedReader.readLine()) != null) {
        logger.info(Arrays.toString(strLine.toCharArray()));
        if(strLine.contentEquals(prevLine)){
            logger.info("Skipping the duplicate lines " + strLine);
            continue;
        }
        prevLine = strLine;
    }

Update:

It seems like there's leading a space in the first line, but actually not, and the trim approach doesn't work for me. They're not the same:

INFO: [, s, o, n, y]
INFO: [ , s, o, n, y]

I don't know what's the first Char added by java.

Solved: the problem was solved with BalusC's solution, thanks for pointing out it's BOM problem which helped me to find out the solution quickly.

Upvotes: 6

Answers (7)

Zack Walton

Reputation: 196

Open the file in a text editor, navigate to File > Save As... and choose UTF-8 encoding, instead of UTF-8 with BOM.

Upvotes: 0

Yash

Reputation: 9588

The Byte Order Mark ^(BOM) is a Unicode character. You will get characters like ï»¿ at the start of a text stream, because BOM use is optional, and, if used, should appear at the start of the text stream.

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.

File file = new File( csvFilename );
FileInputStream inputStream = new FileInputStream(file);
// [{"Key2":"21","ï»¿Key1":"11","Key3":"31"} ]
InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

We can resolve by explicitly specifying charset as UTF-8 to InputStreamReader. Then in UTF-8, the byte sequence ï»¿ decodes to one character, which is U+FEFF (?).

Using Google Guava's ^jar CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

String printable = CharMatcher.INVISIBLE.removeFrom( input );
String clean = CharMatcher.ASCII.retainFrom( printable );

Full Example to read data from the CSV file to JSON Object:

public class CSV_FileOperations {
    static List<HashMap<String, String>> listObjects = new ArrayList<HashMap<String,String>>();
    protected static List<JSONObject> jsonArray = new ArrayList<JSONObject >();

    public static void main(String[] args) {
        String csvFilename = "D:/Yashwanth/json2Bson.csv";

        csvToJSONString(csvFilename);
        String jsonData = jsonArray.toString();
        System.out.println("File JSON Data : \n"+ jsonData);
    }

    @SuppressWarnings("deprecation")
    public static String csvToJSONString( String csvFilename ) {
        try {
            File file = new File( csvFilename );
            FileInputStream inputStream = new FileInputStream(file);

            String fileExtensionName = csvFilename.substring(csvFilename.indexOf(".")); // fileName.split(".")[1];
            System.out.println("File Extension : "+ fileExtensionName);

            // [{"Key2":"21","ï»¿Key1":"11","Key3":"31"} ]
            InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

            BufferedReader buffer = new BufferedReader( inputStreamReader );
            Stream<String> readLines = buffer.lines();
            boolean headerStream = true;

            List<String> headers = new ArrayList<String>();
            for (String line : (Iterable<String>) () -> readLines.iterator()) {
                String[] columns = line.split(",");
                if (headerStream) {
                    System.out.println(" ===== Headers =====");

                    for (String keys : columns) {
                        // ï»¿ - UTF-8 - ? « https://stackoverflow.com/a/11021401/5081877
                        String printable = CharMatcher.INVISIBLE.removeFrom( keys );
                        String clean = CharMatcher.ASCII.retainFrom(printable);
                        String key = clean.replace("\\P{Print}", "");
                        headers.add( key );
                    }
                    headerStream = false;
                    System.out.println(" ===== ----- Data ----- =====");
                } else {
                    addCSVData(headers, columns );
                }
            }
            inputStreamReader.close();
            buffer.close();


        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
    @SuppressWarnings("unchecked")
    public static void addCSVData( List<String> headers, String[] row ) {
        if( headers.size() == row.length ) {
            HashMap<String,String> mapObj = new HashMap<String,String>();
            JSONObject jsonObj = new JSONObject();
            for (int i = 0; i < row.length; i++) {
                mapObj.put(headers.get(i), row[i]);
                jsonObj.put(headers.get(i), row[i]);
            }
            jsonArray.add(jsonObj);
            listObjects.add(mapObj);
        } else {
            System.out.println("Avoiding the Row Data...");
        }
    }
}

json2Bson.csv File data.

Key1    Key2    Key3
11  21  31
12  22  32
13  23  33

Upvotes: 2

deltaforce2

Reputation: 593

I had a similar case in my previous project. The culprit was the Byte order mark, which I had to get rid of. Eventually I implemented a hack based on this example. Check it out, might be that you have the same problem.

Upvotes: 1

Harry Lime

Reputation: 29576

What is the encoding of the file?

The unseen char at the start of the file could be the Byte Order Mark

Saving with ANSI or UTF-8 without BOM can help highlight this for you.

Upvotes: 2

user7094

Reputation:

If spaces are not important in the processing it would probably be worth doing a strLine.trim() call each time anyway. This is what I generally do when handling input like this - spaces can easily creep into a file if it has to be edited manually and if they're not important they can and should be ignored.

Edit: is the file encoded as UTF-8? You may need to specify the encoding when you open the file. It could be the byte order mark or something like that, if it's happening on the first line.

Try:

BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"))

Upvotes: 0

Adeel Ansari

Reputation: 39907

There must be a space or some non-printable character in the start. So, either fix that or trim the Strings during/before comparison.

[Edited]

In case String.trim() is of no avail. Try String.replaceAll() using proper regex. Try this, str.replaceAll("\\p{Cntrl}", "").

Upvotes: 0

Nico Huysamen

Reputation: 10427

Try trimming whitespace at the beginning and end of lines read. Just replace your while with:

while ((strLine = bufferedReader.readLine()) != null) {
        strLine = strLine.trim();
        logger.info(Arrays.toString(strLine.toCharArray()));
    if(strLine.contentEquals(prevLine)){
        logger.info("Skipping the duplicate lines " + strLine);
        continue;
    }
    prevLine = strLine;
}

Upvotes: 1

Java read file got a leading BOM [ &#239;&#187;&#191; ]

Answers (7)

Related Questions

Java read file got a leading BOM [ ï»¿ ]