user2064000
user2064000

Reputation:

Generating regex, from hosts file using bash/sed/awk

I have a hosts file which is in the following format:

# comments

(ipv4/ipv6 address) (multiple hostnames)
.
.
.

I need to convert them to an optimised regular expression using bash/sed/awk. For example, if we have the following in the hosts file:

127.0.0.1 abc.example.com def.examples.com
127.0.0.1 ghi-example.com foobar.com
127.0.0.1 malwaredomain.com malware-domain.com

to be converted as:

(((abc|def)\.|ghi-)\.example\.com|foobar\.com|malware-?domain\.com)

It may be preferable to also have some intelligent conversion. For example, if we have lots of similar entries like:

127.0.0.1 ad-us.adserver.com ad-uk.adserver.com ad-fr.adserver.com ad-de.adserver.com
127.0.0.1 ad-ru.adserver.com ad-ca.adserver.com ad-se.adserver.com ad-be.adserver.com
...

They may be converted as ad\..*\.adserver.com, maybe even as ad\..{2}\.adserver\.com. Of course something like ad-(us|uk|fr|de|ru|ca|se|be)\.adserver\.com works, but I'd prefer to have a generic rule since there's the additional benifit of detecting servers that may be added later.

EDIT: Summarising, if I have I have a hosts file like this:

127.0.0.1 atmdt.com foo.atmdt.com bar.admdt.com
127.0.0.1 anifkalood.ru boeing-job.com ilianorkin.ru humaniopa.ru
127.0.0.1 hillairusbomges.ru mgithessia.biz justintvfreefall.org

The output will be a regex which covers all the servers above:

((((foo|bar)\.?atmdt|boeing-job)\.com)|(anifkalood|hillairusbomges|ilianorkin|humaniopa)\.ru|mgithessia\.biz|justintvfreefall\.org)

How can I acheive this?

Thanks in advance.

Upvotes: 0

Views: 525

Answers (2)

Édouard Lopez
Édouard Lopez

Reputation: 43401

You seems to be looking for a regex generator. Here are some :

I would recommend the Genetic approach, but not sure about the optimization level they have.

Upvotes: 2

Miklos Aubert
Miklos Aubert

Reputation: 4575

This looks more like a Computer Science project than a simple programming question!

I don't think you'll find any straightforward bash/sed/awk instructions to do this. You want to create regular expressions programmatically, and sed/awk are typically more suited to using regexes. I guess you'd have to look into approximate string matching and specifically, computing the Levenshtein distance between two strings.

Upvotes: 0

Related Questions