Reputation: 15927
I am reading a file containing keywords line by line and found a strange problem. I hope lines that following each other if their contents are the same, they should be handled only once. Like
sony
sony
only the first one is getting processed. but the problems is, java doesn't treat them as equals.
INFO: [, s, o, n, y]
INFO: [s, o, n, y]
My code looks like the following, where's the problem?
FileReader fileReader = new FileReader("some_file.txt");
BufferedReader bufferedReader = new BufferedReader(fileReader);
String prevLine = "";
String strLine
while ((strLine = bufferedReader.readLine()) != null) {
logger.info(Arrays.toString(strLine.toCharArray()));
if(strLine.contentEquals(prevLine)){
logger.info("Skipping the duplicate lines " + strLine);
continue;
}
prevLine = strLine;
}
Update:
It seems like there's leading a space in the first line, but actually not, and the trim
approach doesn't work for me. They're not the same:
INFO: [, s, o, n, y]
INFO: [ , s, o, n, y]
I don't know what's the first Char added by java.
Solved: the problem was solved with BalusC's solution, thanks for pointing out it's BOM problem which helped me to find out the solution quickly.
Upvotes: 6
Views: 9613
Reputation: 196
Open the file in a text editor, navigate to File > Save As... and choose UTF-8 encoding, instead of UTF-8 with BOM.
Upvotes: 0
Reputation: 9588
The Byte Order Mark (BOM) is a Unicode character. You will get characters like 
at the start of a text stream, because BOM use is optional, and, if used, should appear at the start of the text stream.
File file = new File( csvFilename );
FileInputStream inputStream = new FileInputStream(file);
// [{"Key2":"21","Key1":"11","Key3":"31"} ]
InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );
We can resolve by explicitly specifying charset as UTF-8
to InputStreamReader. Then in UTF-8, the byte sequence 
decodes to one character, which is U+FEFF (?
).
Using Google Guava's
jar
CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:
String printable = CharMatcher.INVISIBLE.removeFrom( input );
String clean = CharMatcher.ASCII.retainFrom( printable );
Full Example to read data from the CSV file to JSON Object:
public class CSV_FileOperations {
static List<HashMap<String, String>> listObjects = new ArrayList<HashMap<String,String>>();
protected static List<JSONObject> jsonArray = new ArrayList<JSONObject >();
public static void main(String[] args) {
String csvFilename = "D:/Yashwanth/json2Bson.csv";
csvToJSONString(csvFilename);
String jsonData = jsonArray.toString();
System.out.println("File JSON Data : \n"+ jsonData);
}
@SuppressWarnings("deprecation")
public static String csvToJSONString( String csvFilename ) {
try {
File file = new File( csvFilename );
FileInputStream inputStream = new FileInputStream(file);
String fileExtensionName = csvFilename.substring(csvFilename.indexOf(".")); // fileName.split(".")[1];
System.out.println("File Extension : "+ fileExtensionName);
// [{"Key2":"21","Key1":"11","Key3":"31"} ]
InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );
BufferedReader buffer = new BufferedReader( inputStreamReader );
Stream<String> readLines = buffer.lines();
boolean headerStream = true;
List<String> headers = new ArrayList<String>();
for (String line : (Iterable<String>) () -> readLines.iterator()) {
String[] columns = line.split(",");
if (headerStream) {
System.out.println(" ===== Headers =====");
for (String keys : columns) {
//  - UTF-8 - ? « https://stackoverflow.com/a/11021401/5081877
String printable = CharMatcher.INVISIBLE.removeFrom( keys );
String clean = CharMatcher.ASCII.retainFrom(printable);
String key = clean.replace("\\P{Print}", "");
headers.add( key );
}
headerStream = false;
System.out.println(" ===== ----- Data ----- =====");
} else {
addCSVData(headers, columns );
}
}
inputStreamReader.close();
buffer.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
@SuppressWarnings("unchecked")
public static void addCSVData( List<String> headers, String[] row ) {
if( headers.size() == row.length ) {
HashMap<String,String> mapObj = new HashMap<String,String>();
JSONObject jsonObj = new JSONObject();
for (int i = 0; i < row.length; i++) {
mapObj.put(headers.get(i), row[i]);
jsonObj.put(headers.get(i), row[i]);
}
jsonArray.add(jsonObj);
listObjects.add(mapObj);
} else {
System.out.println("Avoiding the Row Data...");
}
}
}
json2Bson.csv
File data.
Key1 Key2 Key3
11 21 31
12 22 32
13 23 33
Upvotes: 2
Reputation: 593
I had a similar case in my previous project. The culprit was the Byte order mark, which I had to get rid of. Eventually I implemented a hack based on this example. Check it out, might be that you have the same problem.
Upvotes: 1
Reputation: 29576
What is the encoding of the file?
The unseen char at the start of the file could be the Byte Order Mark
Saving with ANSI or UTF-8 without BOM can help highlight this for you.
Upvotes: 2
Reputation:
If spaces are not important in the processing it would probably be worth doing a strLine.trim()
call each time anyway. This is what I generally do when handling input like this - spaces can easily creep into a file if it has to be edited manually and if they're not important they can and should be ignored.
Edit: is the file encoded as UTF-8? You may need to specify the encoding when you open the file. It could be the byte order mark or something like that, if it's happening on the first line.
Try:
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"))
Upvotes: 0
Reputation: 39907
There must be a space
or some non-printable character in the start. So, either fix that or trim the Strings
during/before comparison.
[Edited]
In case String.trim()
is of no avail. Try String.replaceAll()
using proper regex
. Try this, str.replaceAll("\\p{Cntrl}", "")
.
Upvotes: 0
Reputation: 10427
Try trimming whitespace at the beginning and end of lines read. Just replace your while with:
while ((strLine = bufferedReader.readLine()) != null) {
strLine = strLine.trim();
logger.info(Arrays.toString(strLine.toCharArray()));
if(strLine.contentEquals(prevLine)){
logger.info("Skipping the duplicate lines " + strLine);
continue;
}
prevLine = strLine;
}
Upvotes: 1