Reputation: 31
I am trying to read in the content of a file to any readable form. I am using a FileInputStream to read from the file to a byte array, and then am trying to convert that byte array into a String.
So far, I have tried 3 different ways:
FileInputStream inputStream = new FileInputStream(file);
byte[] clearTextBytes = new byte[(int) file.length()];
inputStream.read(clearTextBytes);
String s = IOUtils.toString(inputStream); //first way
String str = new String(clearTextBytes, "UTF-8"); //second way
String string = Arrays.toString(clearTextBytes); //third way
String[] byteValue = string.substring(1, string.length() - 1).split(",");
byte[] bytes = new byte[byteValue.length]
for(int i=0, len=bytes.length; i<len; i++){
bytes[i] = Byte.parseByte(byteValue[i].trim());
}
String newStr = new String(bytes);
When I print out each of the Strings:
1) prints out nothing, and
2 & 3) print out a lot of weird characters, such as:
PK!�Q���[Content_Types].xml �(���MO�@��&��f��]���pP<*���v
�ݏ�,_��i�I�(zi�N��}fڝ�
��h�5)�&��6Sf����c|�"�d��R�d���Eo�r��
�l�������:0Tɭ�"Э�p'䧘��tn��&� q(=X����!.���,�_�WF�L8W......
I would love any advice on how to properly convert my byte array to a String.
Upvotes: 2
Views: 4095
Reputation: 533820
As others have noted, the data doesn't look like it contains any text, so it quite possibly binary data, rather than text. Note files which start with PK
could be in PKZIP format and the randomness of your data does suggest it could be compressed. http://www.garykessler.net/library/file_sigs.html
Try making the renaming the file to have .ZIP
at the end and see if you can open it in file explorer.
From the link above, the start of a DOCX file looks as follows.
50 4B 03 04 14 00 06 00 PK...... DOCX, PPTX, XLSX
Microsoft Office Open XML Format (OOXML) Document NOTE: There is no subheader for MS OOXML files as there is with DOC, PPT, and XLS files. To better understand the format of these files, rename any OOXML file to have a .ZIP extension and then unZIP the file; look at the resultant file named [Content_Types].xml to see the content types. In particular, look for the <Override PartName= tag, where you will find word, ppt, or xl, respectively. Trailer: Look for 50 4B 05 06 (PK..) followed by 18 additional bytes at the end of the file.
Assuming you have text data, most likely the character encoding is not your default, nor UTF-8. You need to a) check what the encoding is, b) check the corruption is not when you output the string instead of in the input.
You can try brute force to find a character set which doesn't produce any unknown characters.
public static Set<Charset> possibleCharsets(byte[] bytes) {
Set<Charset> charsets = new LinkedHashSet<>();
for (Charset charset : Charset.availableCharsets().values()) {
if (!new String(bytes, charset).contains("�"))
charsets.add(charset);
}
return charsets;
}
Upvotes: 4
Reputation: 11
I've written a very basic program to read the contents of a file and to print each string on a new line in the console. Here is the content of the file:
Here is the program I wrote:
import java.io.*;
import java.util.*;
class Test {
public static void main(String args[]) throws FileNotFoundException {
File file = new File("File1.txt");
Scanner input = new Scanner(file);
while (input.hasNext()) {
System.out.println(input.next());
}
input.close();
} // main()
} // class Test
This is the output to the console:
apples
pears
1
2
3
oranges
carrots
bananas
pineapples
Upvotes: 0
Reputation: 13858
Check this out for getting text contents of word file: You'd need Apache POI libraries.
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
[...]
XWPFDocument docx = new XWPFDocument(new FileInputStream("file.docx"));
XWPFWordExtractor we = new XWPFWordExtractor(docx);
System.out.println(we.getText());
Upvotes: 0
Reputation: 2030
UTF8 can hold about 2,097,152 different characters, them who have no image you see the questionmark. Try the classic dos codepage instead:
new String(clearTextBytes, "DOS-US");
Upvotes: 0