LinuxNovice
LinuxNovice

Reputation: 13

Elasticsearch - Split string into common words without white space or special characters

I have been trying to figure out the way to split a string into words using elastic search, i have tried using the word_delimiter but it only seems to work if the string is already split for example "this-is-a-string"

However, my goal is to split strings into words like these examples:

"redcar" => "Red Car"
"greatholiday" => "Great Holiday"
"myhouseisred" => "My house is red"

What would the best option? Would i use a custom tokenizer?

Any help would be a huge relief, Thanks!

--- Use Case ---

@Elasticsearch Ninja

I have a database of documents, one of the columns contains strings specific to that document, however, Some of those strings contains English words and are not correctly formatted (There is no way for me to get a copy of already formatted data because the current format is the only way i can receive the data)

For example i have the following columns:

id   |      text        |   document_id
 1         redcar             10844
 2        cheaphouses         22418
 3        notarealstring       9821
 ......
 ......

I want to use elastic search or maybe some other solution that can parse each "Text" field and separate the string based on common English words, Therefore the current documents would become:

Upvotes: 1

Views: 1881

Answers (1)

Amit
Amit

Reputation: 32386

What you are trying to achieve is not possible using any tokenizer, or custom-analyzer in elasticsearch as you don't have a fixed pattern by which you are dividing your text and creating tokens.

But as mentioned earlier in the comments if you try to do this yourself it will not be the efficient and mostly the wrong way to do that and will be really difficult to cover all the use-cases you might have.

In-short, ES doesn't provide out of the box solution and you have to build these tokens in your application but it will not be efficient and performant.

Upvotes: 1

Related Questions