codecreator
codecreator

Reputation: 31

Calculating similarites between sentences

I have datbase with thousands of rows of error logs and their description.This error log is for an application that running 24/7. I want to create a dashboard/UI to view the current common errors happening for prodcution support.

The problem I am having is that even though there are lot of common errors, the error description differs by the transcation ID or user ID or things that are unique for that sigle prcoess.

e.g Error trasaction XYz failed for user 233 e.g 2. Error trasaction XYz failed for user 567

I consider these two erros to be same. So I want to a program that will go through the new error logs and classify them into groups. I am trying to use "edit distance" but its very slow.Since I alraedy have old error logs, i am trying to think of solutions using that information too. Any thoughts?

Upvotes: 3

Views: 255

Answers (2)

Mikos
Mikos

Reputation: 8543

There are other metrics (apart from Levenshtein) which might be more appropriate. Have you considered Cosine Similarity?

SimMetrics is an F/OSS library that offers an extensive collection of similarity algorithms and their corresponding cost functions.

Upvotes: 1

John Doty
John Doty

Reputation: 3333

I'm assuming that the error messages are generated by a program, and so they probably fall into very specific pattern.

That means you don't have to do anything particularly complex. Just parse the error messages: use regular expressions (or maybe something more powerful) to split the messages into tuples. Then group or count or do something with the individual fields. For example, you could do a regex like "Error transaction ([A-Z]*) failed for user ([0-9]*)". You could then make a histogram of the error codes (first capture group) or users (second capture group).

Upvotes: 1

Related Questions