Watt
Watt

Reputation: 3164

How to preserve correct offset of string which is read from a file

I have a text.txt file which contains following txt.

 Kontagent Announces Partnership with Global Latino Social Network Quepasa

 Released By Kontagent

I read this text file into a string documentText.

documentText.subString(0,9) gives Kontagent, which is good.

But, documentText.subString(87,96) gives y Kontage in windows (IntelliJ Idea) and gives Kontagent in Unix environment. I am guessing it is happening because of blank line in the file (after which the offset got screwed). But, I cannot understand, why I get two different results. I need to get one result in the both the environments.

To read file as string I used all the functions talked about here How do I create a Java string from the contents of a file? . But, I still get same results after using any of the functions.

Currently I am using this function to read the file into documentText String:

public static String readFileAsString(String fileName)
{

    File file = new File(fileName);
    StringBuilder fileContents = new StringBuilder((int)file.length());
    Scanner scanner = null;
    try {
        scanner = new Scanner(file);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }
    String lineSeparator = System.getProperty("line.separator");

    try {
        while(scanner.hasNextLine()) {
            fileContents.append(scanner.nextLine() + lineSeparator);
        }
        return fileContents.toString();
    } finally {
        scanner.close();
    }
}

EDIT: Is there a way to write a general function which will work for both windows and UNIX environments. Even if file is copied in text mode. Because, unfortunately, I cannot guarantee that everyone who is working on this project will always copy files in binary mode.

Upvotes: 1

Views: 191

Answers (3)

Watt
Watt

Reputation: 3164

Based on input you guys provided, I wrote something like this

documentText  = CharStreams.toString(new FileReader("text.txt"));
documentText = this.documentText.replaceAll("\\r","");

to strip off extra \r if a file has \r.

Now,I am getting expect result in windows environment as well as unix. Problem solved!!!

It works fine irrespective of what mode file has been copied.

:) I wish I could chose both of your answer, but stackoverflow doesn't allow.

Upvotes: 0

JB Nizet
JB Nizet

Reputation: 691685

The Unix file probably uses the native Unix EOL char: \n, whereas the Windows file uses the native Windows EOL sequence: \r\n. Since you have two EOLs in your file, there is a difference of 2 chars. Make sure to use a binary file transfer, and all the bytes will be preserved, and everything will run the same way on both OSes.

EDIT: in fact, you are the one which appends an OS-specific EOL (System.getProperty("line.separator")) at the end of each line. Just read the file as a char array using a Reader, and everything will be fine. Or use Guava's method which does it for you:

String s = CharStreams.toString(new FileReader(fileName)); 

Upvotes: 2

Daniel Li
Daniel Li

Reputation: 15379

On Windows, a newline character \n is prepended by \r or a carriage return character. This is non-existent in Linux. Transferring the file from one operating system to the other will not strip/append such characters but occasionally, text editors will auto-format them for you.

Because your file does not include \r characters (presumably transferred straight from Linux), System.getProperty("line.separator") will return \r\n and account for non-existent \r characters. This is why your output is 2 characters behind.

Good luck!

Upvotes: 2

Related Questions