Reputation: 45
JAVA, When I try to open and read a file with BufferedReader, I'd received an error message that I used wrong encoding. So system invoked an exception that my encoder can not read the file. In this case, how i can know which kind of encoding is used to the file. Of cause, if the file is written with "utf-8", then it's impossible to read the file with "euc-kr" encoding. My question is I'd like to get the Charset information before opening the file so that I can select right encoding scheme for that file. Anybody help me?
here is my code
package lecture06;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.LinkOption;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Scanner;
public class FindExample01 {
/**
* Initialized in : getInput()
* Used at : findPattern()
*/
private static String pattern;
/**
* Initialized in initApplication
* @param args
*/
private static BufferedWriter wBuffer;
public static void main(String[] args) {
initApplication();
Path dir = Paths.get(getInput());
System.out.println("root = " + dir.toString());
System.out.println("pattern = " + pattern);
searchDirectory(dir.toString());
try {
wBuffer.flush();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
wBuffer.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private static void initApplication()
{
try {
wBuffer = Files.newBufferedWriter(Paths.get("Index.txt"), StandardCharsets.UTF_8);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private static String getInput()
{
Scanner sc = new Scanner(System.in);
String dir = null;
for(;;)
{
System.out.println("Root Directory: ");
dir = sc.next();
if (Files.exists(Paths.get(dir), LinkOption.NOFOLLOW_LINKS)) break;
}
for(;;)
{
System.out.println("Find what ?");
pattern = sc.next();
if (pattern.length() > 2)
{
sc.close();
return dir;
}
}
}
private static void searchDirectory(String root)
{
File fiRoot = new File(root);
File[] files = fiRoot.listFiles();
for (File file : files)
{
if (file.isDirectory()) searchDirectory(file.getAbsolutePath());
else findPattern(file.toPath());
}
}
private static void findPattern(Path path)
{
try {
BufferedReader rBuffer = Files.newBufferedReader(path, StandardCharsets.UTF_8 );
int count = 1;
String line;
while ((line = rBuffer.readLine()) != null)
{
int idx;
while ((idx = line.indexOf(pattern)) != -1)
writeIndex(path.toString(), count, idx);
count++;
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private static void writeIndex(String path, int count, int idx)
{
try {
wBuffer.write(path + " : " + count + " : " + idx + " : " + pattern);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
wBuffer.newLine();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Upvotes: 0
Views: 71
Reputation: 298
Try juniversalchardet, it's an encoding detector library. It has a list of popular encodings that can be detected. For this you don't need to read the whole file, just the first bytes
byte[] buf = new byte[4096];
UniversalDetector detector = new UniversalDetector(null);
int nread;
while ((nread = fileInputStream.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
detector.dataEnd();
String encoding = detector.getDetectedCharset();
Upvotes: 2