Reputation: 141

Removing whitespaces in text file

I had to write a simple code that counts words in a text file. Then someone told me, that it's incomplete because when, for example, there will be 2 or more whitespaces in a row, function will count them as a words and result will be incorrect. So i tried to fix it by making a list and removing all " " elements there, but it doesn't seem to work. May you suggest what can be done?

Here's the code as it is now:

    int count = 0;
    File file = new File("C:\\Users\\user\\Desktop\\Test.txt");
    FileInputStream fis = new FileInputStream(file);
    byte[] bytesArray = new byte[(int) file.length()];
    fis.read(bytesArray);
    String s = new String(bytesArray);
    String[] data = s.split(" ");
    List<String> list = new ArrayList<>(Arrays.asList(data));
    list.remove(" ");
    data = list.toArray(new String[0]);
    for (int i = 0; i < data.length; i++) {
        count++;
    }
    System.out.println("Number of words in the file are " + count);

Upvotes: 0

Answers (4)

Sampisa

Reputation: 1583

Be a nerd. You can do it in just one line, using classes in java.nio.file package :)

int count = new String(Files.readAllBytes(Paths.get("/tmp/test.txt")), "UTF-8")
           .trim().split("\\s+").length;

to count how many words are in the file. Or

String result = new String(Files.readAllBytes(Paths.get("/tmp/test.txt")), "UTF-8")
           .trim().replaceAll("\\s+", " ");

to have a single string with content correctly replaced.

Upvotes: 1

hunter

Reputation: 4173

Best way to handle this kind of requirement: first we should know the character encoding that has been used in the text file. based on that we should try to read the file byte by byte and at the same time do the processing Ex: if the file is utf-8 when you read the first byte , we can identify how many more bytes should be read to get the first character.like that , when we found a "." or " " or line break , then we can identify it as a word separator.

This way is efficient (specially for large files) and always file encoding matters.

if we call the String constructor with the byte[] , it always use the default encoding and it also iterate the array byte by byte.

Upvotes: 0

Wazaki

Reputation: 899

Try this line of code:

String data1 = s.trim().replaceAll(" +", " ");

before the line:

String[] data = data1.split(" ");

This should remove any occurrence of 2 or more consecutive spaces in the String s. No need to use list.remove(" ")

Upvotes: 1

Vahid Hashemi

Reputation: 5240

You Can achieve this by regex

String[] data = s.split("\s+");

        int count = 0;
        File file = new File("/home/vahid/Documents/test.txt");
        FileInputStream fis = new FileInputStream(file);
        byte[] bytesArray = new byte[(int) file.length()];
        fis.read(bytesArray);
        String s = new String(bytesArray);
        String[] data = s.split("\\s+");
        List<String> list = new ArrayList<>(Arrays.asList(data));
        list.remove(" ");
        data = list.toArray(new String[0]);
        for (int i = 0; i < data.length; i++) {
            count++;
        }
        System.out.println("Number of words in the file are " + count);

Upvotes: 3

Removing whitespaces in text file

Answers (4)

String[] data = s.split("\s+");

Related Questions