Shiladittya Chakraborty
Shiladittya Chakraborty

Reputation: 4408

Encoding text URL

Below is my text

Test[LF]
[LF]
Test[LF]
[LF]
Test[LF]
Test[LF]

In notepad++ after enabling show symbol its showing [LF] symbol as displayed in above.

When endococing above text it showing as below

Test%0D%0A%0D%0ATest%0D%0A%0D%0ATest%0D%0ATest

[LF] encoded as %0D%0A

My question is why is encoded as %0D%0A? Because [LF] encode as %OA

where as [CR] encode as [%OD] but in above text I am not used [CR] character.

Upvotes: 0

Views: 306

Answers (1)

Renato
Renato

Reputation: 13690

You can use this Java class to find out each byte of your input file: package example;

import java.io.File;
import java.nio.file.Files;
import java.util.Arrays;

public class FileBytes {
    public static void main( String[] args ) throws Exception {
        if (args.length != 1) {
            throw new IllegalArgumentException( "Please provide one argument" );
        }
        File f = new File( args[0] );
        System.out.println( Arrays.toString( Files.readAllBytes( f.toPath() ) ) );
    }
}

You'll see something like this:

[84, 101, 115, 116, 10, 84, 101, 115, 116, 10]

You can see what each value means in an ASCII table if you're lucky and your file is encoded with UTF-8 or ASCII and only contains ASCII characters (if not, then translating bytes to characters will be quite complicated - look up about the particular encoding you're using).

For example, 84 == T and 10 == LF (Line Feed), so you could translate the above to Test(LF)Test(LF).

To escape the whole String in the file so it's safe to use in a URL, use URLEncoder as in this example:

package example;

import java.io.File;
import java.net.URLEncoder;
import java.nio.file.Files;
import java.util.Arrays;

public class FileBytes {
    public static void main( String[] args ) throws Exception {
        if ( args.length != 1 ) {
            throw new IllegalArgumentException( "Please provide one argument" );
        }
        File f = new File( args[ 0 ] );
        byte[] bytes = Files.readAllBytes( f.toPath() );
        String rawText = new String( bytes, "UTF-8" );
        String encodedText = URLEncoder.encode( rawText, "UTF-8" );

        System.out.println( "Raw text: " + rawText );
        System.out.println( "Encoded text: " + encodedText );
        System.out.println( "Raw bytes: " + Arrays.toString( bytes ) );
        System.out.println( "Encoded bytes: " + Arrays.toString( encodedText.getBytes() ) );
        System.out.println( Arrays.toString( bytes ) );
    }
}

Which prints:

Raw text: Test
Test

Encoded text: Test%0ATest%0A
Raw bytes: [84, 101, 115, 116, 10, 84, 101, 115, 116, 10]
Encoded bytes: [84, 101, 115, 116, 37, 48, 65, 84, 101, 115, 116, 37, 48, 65]

Which clearly shows that the line-feed (10) is encoded as %0A (37, 48, 65).

If you still see %0D (Carriage Return) in the bytes, your editor is adjusting line-endings automatically to match Windows' convention. There's an option in Notepad++ to select line-endings explicitly.

Upvotes: 1

Related Questions