Reputation: 595
Say we have a main string contains some text which is in UTF-8 and another string which is a word and this will be in UTF-8 format as well.So please help me to do this in Java.Thank you.
import java.awt.Component;
import java.io.File;
import javax.swing.JFileChooser;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;
import java.io.Writer;
public class Example {
private static Component frame;
public static void main(String args[]) throws FileNotFoundException, IOException{
JFileChooser fc = new JFileChooser();
int returnVal = fc.showOpenDialog(frame); //Where frame is the parent component
File file = null;
if (returnVal == JFileChooser.APPROVE_OPTION) {
file = fc.getSelectedFile();
//Now you have your file to do whatever you want to do
String str = file.getName();
str = "c:\\" + str;
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(str),"UTF8"));
String line;
String wordfname = "c:\\word.txt";
BufferedReader innew = new BufferedReader(new InputStreamReader(new FileInputStream(wordfname),"UTF8"));
String word;
word = innew.readLine();
System.out.println(word);
File fileDir = new File("c:\\test.txt");
Writer out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileDir), "UTF8"));
while((line = in.readLine()) != null)
{
System.out.println(line);
out.append(line).append("\r\n");
boolean r = line.contains(word);
System.out.println(r);
}
out.flush();
out.close();
System.out.println(str);
}
else {
//User did not choose a valid file
}
}
}
Link to the two files are: https://www.dropbox.com/s/4ej0hii6gnlwtga/kannada.txt and https://www.dropbox.com/s/emncfr7bsi8mvwn/word.txt
Upvotes: 0
Views: 1393
Reputation: 595
Thank you all for your help. Now i'm able to find the substring.It worked when i made the text to be on next line in word.txt file and read that word in second readLine() statement.
Upvotes: 0
Reputation: 109547
In fact you did everything fine, apart from some UTF-8 details. Java Reader/Writer/String handle Unicode.
(Please close the readers too, and flush before close is not needed.)
There is one thing: zero-width combining diacritical marks. Small c-circumflex, ĉ
, is one character in the Unicode table, code-point U+0109, java "\u0109", but can also be two Unicode code-points: c
, plus a zero-width ^
, "e\u0302".
There exists a text normalization in java which transforms into a specific form.
String cCircumflex = "\u0109"; // c^
String cWithCircumflex = "c\u0302"; // c^
String cx = Normalizer.normalize(cCircumflex, Normalizer.Form.NFKC);
String cx2 = Normalizer.normalize(cWithCircumflex, Normalizer.Form.NFKC);
assert cx.equals(cx2);
Which normalisation to chose from is more or less irrelevant, composition (...C
) seeming most natural (and gives better font rendering), but decomposition ...D
allows natural sorting to be "aäá...cĉ...eé...".
You could even search words, with diacritical marks removed (cafe versus café):
word = Normalizer.normalize(word, Normalizer.Form.NFKD); // Decompose.
word = word.replaceAll("\\p{M}", ""); // Remove diacriticals.
word = word.replaceAll("\\p{C}", ""); // Optional: invisible control characters.
After running the original code
It seems to work with me, without any change (Java 8). Though I had to put kannada.txt on C:\
.
ಅದರಲ್ಲಿ
್ರಪಂಚದಲ್ಲಿ ಅನೇಕ ಮಾಧ್ಯಮಗಳು ಇದೆ. ಆಕಾಶವಾಣಿ, ದೂರದರ್ಶನ, ವಾರ್ತಾ ಪತ್ರಿಕೆ ಮುಂತಾದವು ಅದರಲ್ಲಿ ದೂರದರ್ಶನಪ ಪ್ರಮುಖವಾದ ಕಾರ್ಯವನ್ನು ಹೊಂದಿದ್ದು ಅದನ್ನು ಚಿಕ್ಕವರಿಂದ ಹಿಡಿದು ದೊಡ್ಡವರವರೆಗೂ ನೋಡುತ್ತಾರೆ. ಇದಕ್ಕೆ ಇಂಗ್ಲೀಷ್ನಲ್ಲಿ ಟೆಲಿವಿಷನ್ ಎಂದು ಚಿಕ್ಕದಾಗಿ ಟಿ.ವಿ. ಎಂದು ಕರೆಯುವ ಬದಲು ಟಿ.ಕೆ. ಎಂದು ಕರೆಯಬೇಕಾಗಿತ್ತು. ಏಕೆಂದರೆ ಇದು ಟೆಲಿವಿಷನ್ ಅಷ್ಟೇ ಅಲ್ಲ ಟೈಮ್ ಕಿಲ್ಲರ್ ಕೂಡ. ಇದನ್ನು ಪ್ರಮುಖವಾಗಿ ವಯಸ್ಸಾದವರು ನೋಡುತ್ತಾರೆ. ಆದರೆ ಕೆಲಸಕ್ಕೆ ಬಂದ ಕೆಲಸದವರು ತಾವು ಕೆಲಸ ಮಾಡುವ ಬದಲು ಮನೆಯಲ್ಲಿ ಕುಳಿತು ನೋಡುತ್ತಾರೆ.
true
false
ನನ್ನ ಪ್ರಕಾರ ಹೇಳಬೇಕಾದರೆ ಡಾಕ್ಷರ್ಗಳಿಗೆ ದುಡ್ಡು ಕೊಡುವ ಮಹಾಲಕ್ಷ್ಮಿ ಈ ಟಿ.ವಿ.
false
c:\kannada.txt
Upvotes: 1
Reputation: 2291
String objects actually have fixed UTF-16 encoding.
byte[] has technically no encoding. but you can attach an encoding to byte[] tough. so if you need UTF-8 encoded data, you need a byte[].
so my approach would be
byte[] text = String.getBytes("UTF-8");
to get an UTF-8 byte[]..
IMHO but findeing a substring in a string (which is fully UTF-16!) which is UTF-8 encoded is senseless :)
Upvotes: 0