Reputation: 11

Effectively filtering strings in java

I'm trying to make something like a mini search engine right now. My goal is to index a bunch of files in a hashmap but first i need to perform a couple of operations, which include lowering capitals, removing all unnecessary words and also removing all characters except a-z/A-Z. Right now my implementation looks like this:

String article = "";

for (File file : dir.listFiles()) { //for each file (001.txt, 002.txt...)
        Scanner s = null;
        try {
            s = new Scanner(file);
            while (s.hasNext())
                article += s.next().toLowerCase(Locale.ROOT) + " "; //converting all characters to lower case
            article = currentWord.replaceAll(delimiters.get()," "); //removing punctuations (?, -, !, * etc...) 

            String splittedWords = article.split(" ");  //splitting each word into a string array
            for(int i = 0; i < splittedWords.length; i++) {
                s = new Scanner(stopwords);
                boolean flag = true;
                while(s.hasNextLine())
                    if (splittedWords[i].equals(s.nextLine())) { //comparing each word with all the stop words (words like a, the, already, these etc...) taken from another big txt file and removing them, because we dont need to fill our map with unnecessary words, to provide faster search times later on
                        flag = false;
                        break;
                    }
                if(flag) map.put(splittedWords[i], file.getName()); //if current word in splittedWords array does not match any stop word, put it in the hashmap        


            }
            s.close();


        } catch (FileNotFoundException e) {

            e.printStackTrace();
        }
        s.close();
        System.out.println(file);
    }

this is just a block from my code, it can contain missing pieces, i explained my algorithm in a cursory way with comments. Using .contains method to check if stopWords contains any currentWord, even though its a faster approach, it does not map words like "death" because it contains "at" from stop words list. Im trying to do my best to make it more effective but i haven't progressed much. each file containing approx. ~300 words each takes ~3 seconds to index which is not ideal considering i have ten thousands of files. Any ideas on how can i improve my algorithm to make it run faster?

Upvotes: 0

Answers (2)

Robert

Reputation: 42640

There are some improvements:

First please don't use the new Scanner(File) constructor as it uses unbuffered I/O. Small disk read operations especially on HDDs are very ineffective. Use instead for example a BufferedInputStream with 65 KB buffer:

try (Scanner s = new Scanner(new BufferedInputStream(new FileInputStream(f), 65536))) {
    // your code
}

Second: Most likely your PC has a multi-code CPU. Therefore you can scan multiple files in parallel. To do you you have to ensure that you use a multi-threaded aware map. Change the definition of the map to:

Map<String,String> map = new ConcurrentHashMap<>();

Then you can use the following code:

Files.list(dir.toPath()).parallel().forEach(f -> {
    try (Scanner s = new Scanner(new BufferedInputStream(Files.newInputStream(f), 65536))) {
        // your code
    } catch (IOException e) {
        e.printStackTrace();
    }
});

Depending on the CPU cores in your system it will process multiple files at the same time. Especially if you process a large number of files this will drastically reduce the run-time your your program.

At last your implementation is rather complicated. You use the output of Scanner to create a new String that is then splitted again. Instead it would be better to configure Scanner to directly consider the delimiter you want:

try (Scanner s = new Scanner(....).useDelimiter("[ ,\\!\\-\\.\\?\\*]")) {

Then you can directly use the tokens created by Scanner and don't have to build the article String and later split it.

Upvotes: 1

edwgiz

Reputation: 862

what's the reason to implement search engine by yourself ?

For production I would recommend existing solution - Apache Lucene, that's perfectly matches your task.

If you are just training there are several standard points to improve your code.

Avoid string concatenation in cycle like this article +=. It's munch better to create a word regexp and to pass it into Scanner.

    Pattern p = Pattern.compile("[A-Za-z]+");
    try (Scanner s = new Scanner(file)) {
        while (s.hasNext(p)) {
            String word = s.next(p);
            word = word.toLowerCase(Locale.ROOT);
            ...
        }
    }

Put all stopwords into hashmap and check each new-coming word just with containsKey method

Upvotes: 0

Effectively filtering strings in java

Answers (2)

Related Questions