Reputation: 117
I have class with main:
public class Main {
// args[0] - is path to file with first and last words
// args[1] - is path to file with dictionary
public static void main(String[] args) {
try {
List<String> firstLastWords = FileParser.getWords(args[0]);
System.out.println(firstLastWords);
System.out.println(firstLastWords.get(0).length());
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
and I have FileParser:
public class FileParser {
public FileParser() {
}
final static Charset ENCODING = StandardCharsets.UTF_8;
public static List<String> getWords(String filePath) throws IOException {
List<String> list = new ArrayList<String>();
Path path = Paths.get(filePath);
try (BufferedReader reader = Files.newBufferedReader(path, ENCODING)) {
String line = null;
while ((line = reader.readLine()) != null) {
String line1 = line.replaceAll("\\s+","");
if (!line1.equals("") && !line1.equals(" ") ){
list.add(line1);
}
}
reader.close();
}
return list;
}
}
args[0]
is the path to txt file with just 2 words. So if file contains:
тор
кит
programm returns:
[тор, кит]
4
If file contains:
т
тор
кит
programm returns:
[т, тор, кит]
2
even if file contains:
//jump to next line
тор
кит
programm returns:
[, тор, кит]
1
where digit - is length of the first string in the list.
So the question is why it counts one more symbol?
Upvotes: 4
Views: 294
Reputation: 117
Thanks all.
This symbol as said @Bill is BOM (http://en.wikipedia.org/wiki/Byte_order_mark) and reside at the beginning of a text file.
So i found this symbol by this line:
System.out.println(((int)firstLastWords.get(0).charAt(0)));
it gave me 65279
then i just changed this line:
String line1 = line.replaceAll("\\s+","");
to this
String line1 = line.replaceAll("\uFEFF","");
Upvotes: 2
Reputation: 9405
Cyrillic characters are difficult to capture using Regex, eg \p{Graph}
does not work, although they are clearly visible characters. Anyways, that is besides the OP question.
The actual problem is likely due to other non-visible characters, likely control characters present. Try following regex to remove more: replaceAll("(\\s|\\p{Cntrl})+","")
. You can play around with the Regex to further extend that to other cases.
Upvotes: 1