user799188
user799188

Reputation: 14435

Split String When Words Joined Without Delimiter

We have quite a lot of text (mostly written in English) which was incorrectly imported (from a source we have no control over). For example

  1. configuredincorrectly - into the 2 words configured & incorrectly
  2. RegardsJohn Doe - into a word Regards and a named entity John Doe
  3. To: [email protected]:[email protected]:[email protected] - into 3 tuples (To,[email protected]), (CC,[email protected]), (BCC,[email protected])
  4. problem.Possible - into the 2 words problem & possible

I acknowledge that we are trying to address multiple problems here. It is tempting to write non-scalable code such as

  1. regular expressions each time we try to solve a particular dirty text scenario,
  2. string.replace(keyword,keywordwithSpace)

Could anyone please point me towards a (partial) solution for problems 1 & 2?

A solution which made use of natural language understanding would be most ideal. We have ~ 1000 words in our vocabulary, such as [communication, database, hardware, network, problem, rectify, solution, etc.]. Is there a way we can "train" a model to recognize that words like hardwarefailure really mean 2 separate words hardware & failure.

Many thanks in advance!

Upvotes: 0

Views: 895

Answers (2)

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

There is answer to the same question containing link to Python ICU library.

And there is working code in Python based on dictionary with frequencies.

Also look at this question: the author has already developed working solution - it's in Java, but open-source and having documentation.

Upvotes: 1

Denis Tarasov
Denis Tarasov

Reputation: 1051

Some languages such as Chinese don't have spaces (or other delimiters) between words. Therefore I think, approaches, developed for segmenting such languages can be useful here (see this paper for example system description and also this one).

The basic idea is that we train some classifier to classify characters:

"Each Chinese character can be assigned one of four possible boundary tags: S for a character that occurs as a single-character word, B for a character that begins a multi-character word, E for a character that ends a multi-character word, and M for a character that is neither first nor last"

Classifier can be maximum entropy model, conditional random field, recurrent neural network, or some other. Codes that implement them are readily avaliable as standalone programs and (for a number of classifiers) as python libraries/bindings. Google search should reveal a lot of them.

Thus we can take a lot of corrupted text (that can be easily generated) and assign tags for each letter (can be done automatically if we generate corrupted text from original form). That will give us training set, as large as we want. For each character in a string we will need to generate feature vector (usually including information about previous characters, but we can add some dictionary-based features). At run time we can first tag string such as "hardwarefailure" and split it at the character tagged with "B".

One note of caution: developing any machine-learning solution can be time-consuming and sometimes can fail to work at all, especially if you never did that before.

Upvotes: 2

Related Questions