bernie2436
bernie2436

Reputation: 23901

AI to learn patterns in invalid data?

I work at a public health department that takes in and stores lots of medical data every day. I've written a program that uses regular expressions to determine if particular fields in the incoming data are valid or invalid. Ex: DOBs come in as YYYYmmDD, so they should match regex ^[0-9]{8}$

I want to analyze the "invalid" data to help identify problems in our system (we get way too much data to go through each 'bad' record row-by-row). Can anyone suggest AI techniques/machine learning techniques that can 'monitor' the bad data and find patterns in what is wrong? I think that coming up with a bunch of regular expressions for possible ways the data could be invalid (ex. not enough or too many characters) and then keeping track of those results might work. But instead of me thinking up all of the ways the data could be invalid, I'm curious about ways to 'learn' the patterns from the bad data using AI.

Are there any known techniques that do this?

Upvotes: 4

Views: 305

Answers (3)

Shaggy Frog
Shaggy Frog

Reputation: 27601

I think that coming up with a bunch of regular expressions for possible ways the data could be invalid (ex. not enough or too many characters) and then keeping track of those results might work. But instead of me thinking up all of the ways the data could be invalid, I'm curious about ways to 'learn' the patterns from the bad data using AI.

What's funny is I'm reminded of a quotation usually attributed to Jamie Zawinski:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Except, in this case, I think the hand-crafted regex route is actually your best bet!

Irony of ironies.

Anyway.

The point of this saying is that people tend to overcomplicate their solutions. Here, regexs are actually a fairly simple solution to your problem, whereas creating a learner is something that will take you a lot more time than I think you realize.

There are fewer ways for this very constrained data representation (a date) to be expressed correctly, than there are ways for it to be expressed incorrectly. Because there are infinite ways to define bad data. You want to train a learner to detect all of them? It's a rabbit hole. Think of this AI learner instead as a coworker or a friend: how would you describe to them all the ways that dates can't be represented properly?

While your intention was to make less work for yourself in the long run -- and that's a good quality to have -- figuring out how to develop a learner, not to mention train and validate it, not to mention watch it carefully, outweigh any benefits that learner can provide you in such a narrow use case.

Upvotes: 3

phs
phs

Reputation: 11051

It sounds like you want to apply supervised learning to regular expressions. These fellows seem to be up to something of that sort.

Upvotes: 1

Stu
Stu

Reputation: 15769

Bayesian filtering might be what you are looking for.

Upvotes: 2

Related Questions