Rekharaj
Rekharaj

Reputation: 11

search a unicode string in a file using java

How to search a unicode string in a file using java? Below is the code that I have tried.It works strings other than unicode.

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    import java.io.*;
    import java.util.*;
    class file1
    {
   public static void main(String arg[])throws Exception
   {
    BufferedReader bfr1 = new BufferedReader(new InputStreamReader(
            System.in));
    System.out.println("Enter File name:");
    String str = bfr1.readLine();
    BufferedReader br=new BufferedReader(new InputStreamReader(System.in));
    String s;
    int count=0;
    int flag=0;

    System.out.println("Enter the string to be found");
    s=br.readLine();
    BufferedReader bfr = new BufferedReader(new FileReader(str));
    String bfr2=bfr.readLine();
    Pattern p = Pattern.compile(s);
            Matcher matcher = p.matcher(bfr2);
            while (matcher.find()) {
            count++;
            }System.out.println(count);
   }}

Upvotes: 1

Views: 1078

Answers (1)

Jon Skeet
Jon Skeet

Reputation: 1500495

Well, there are three potential sources of problems I can see:

  • The regular expression may be incorrect. Do you really need to use a regular expression? Are you trying to match a pattern, or just a simple string?
  • You may be failing to get non-ASCII input from the command line. You should dump out the input string in terms of its Unicode characters (see code later).
  • You may well be reading the file in the wrong encoding. Currently you're using FileReader which always uses the platform default encoding. What's the encoding of the file you're trying to read? I would recommend using FileInputStream wrapped in an InputStreamReader using an explicit encoding (e.g. UTF-8) which matches the file.

To debug the real values in strings, I would usually use something like this:

private static void dumpString(String text) {
    for (int i = 0; i < text.length(); i++) {
        char c = text.charAt(i);
        System.out.printf("%d: %4h (%c)", i, c, c);
        System.out.println();
    }
}

That way you can see the exact UTF-16 code point in each char in the string.

Upvotes: 3

Related Questions