user471011
user471011

Reputation: 7374

Java. Search words in input text on server. Ideas for implementation

For example I have this situation:

on server we have list of words:

{'word1', 'word2', 'word3', 'word4'}

User send request to the server with some text:

"some text here word1. many many other text word4"

Server must processing this input text, find all words in this text from server list and mark this words and send resulting text to the user:

"some text here <mark>word1<mark>. many many other text <mark>word4<mark>"

It is main idea, main concept. At this moment I must implement this logic.

So, I ask you about help.

It is necessary for me to be defined technologies and instruments.

What instruments you can recommend for this task?

Upvotes: 2

Views: 248

Answers (3)

AlexR
AlexR

Reputation: 115368

Here is the naive solution:

for (String word : words) {
    text = text.replaceAll(word, "<mark>" +word + "</mark>");
}

Better solution should use regular expression to avoid replacement of word fragments, e.g. wo<mark>man</mark>. You should create regex like "\\b" + word + "\\b".

But I'd suggest you to check out ready for use engines like Solr (or Lucine).

Upvotes: 2

Samuel Edwin Ward
Samuel Edwin Ward

Reputation: 6675

The simplest way to accomplish this would be to use String.replaceAll. You can combine all of the key words into one regular expression and use a back-reference to include the original word. If the keywords contain regular expression operators you will have to escape those.

It's usually a mistake to call String.replaceAll in a loop because the intermediate results could contain a match that wasn't in the input. As a contrived example, suppose I wanted to replace "ab" with "b" and "bb" with "c". So, the correct output for "bab" would be "bb". However, "bab".replaceAll("ab", "b").replaceAll("bb", "c") is "c". For the same reason, you wouldn't want to use String.replace in a loop although that seems like the easiest way to accomplish the task at hand.

If you need more performance than this requires, the first step would be to compile the regular expressions in advance. If you need a lot more, there are some really interesting research papers on string search.

Upvotes: 1

b.buchhold
b.buchhold

Reputation: 3906

There are many open questions like what exactly delimits "words". E.g. do you wish to highlight "full" in "full-text"?

  1. However here's a really simple idea:
  2. Collect the servers's words in a HashSet,
  3. Parse each request, i.e. identify words according to what you want as delimiters. (linear)
  4. For each token / word check membership in the HashMap (O(1))
  5. Write the word, or the word including your marked-tags to the output.

By the way: Lucene, Solr, etc won't help too much here. Of course, you can use them, but it just doen't make sense. Their strength is to build an index of text. Text can mean HUGE amounts of data. A set of words is bounded by the dictonary of the language. Is usually is a joke size-wise for computers. A simple HashSet should suffice your needs.

Upvotes: 2

Related Questions