Reputation:
I have a hosts file which is in the following format:
# comments
(ipv4/ipv6 address) (multiple hostnames)
.
.
.
I need to convert them to an optimised regular expression using bash/sed/awk. For example, if we have the following in the hosts file:
127.0.0.1 abc.example.com def.examples.com
127.0.0.1 ghi-example.com foobar.com
127.0.0.1 malwaredomain.com malware-domain.com
to be converted as:
(((abc|def)\.|ghi-)\.example\.com|foobar\.com|malware-?domain\.com)
It may be preferable to also have some intelligent conversion. For example, if we have lots of similar entries like:
127.0.0.1 ad-us.adserver.com ad-uk.adserver.com ad-fr.adserver.com ad-de.adserver.com
127.0.0.1 ad-ru.adserver.com ad-ca.adserver.com ad-se.adserver.com ad-be.adserver.com
...
They may be converted as ad\..*\.adserver.com
, maybe even as ad\..{2}\.adserver\.com
. Of course something like ad-(us|uk|fr|de|ru|ca|se|be)\.adserver\.com
works, but I'd prefer to have a generic rule since there's the additional benifit of detecting servers that may be added later.
EDIT: Summarising, if I have I have a hosts file like this:
127.0.0.1 atmdt.com foo.atmdt.com bar.admdt.com
127.0.0.1 anifkalood.ru boeing-job.com ilianorkin.ru humaniopa.ru
127.0.0.1 hillairusbomges.ru mgithessia.biz justintvfreefall.org
The output will be a regex which covers all the servers above:
((((foo|bar)\.?atmdt|boeing-job)\.com)|(anifkalood|hillairusbomges|ilianorkin|humaniopa)\.ru|mgithessia\.biz|justintvfreefall\.org)
How can I acheive this?
Thanks in advance.
Upvotes: 0
Views: 525
Reputation: 43401
You seems to be looking for a regex generator. Here are some :
I would recommend the Genetic approach, but not sure about the optimization level they have.
Upvotes: 2
Reputation: 4575
This looks more like a Computer Science project than a simple programming question!
I don't think you'll find any straightforward bash/sed/awk instructions to do this. You want to create regular expressions programmatically, and sed/awk are typically more suited to using regexes. I guess you'd have to look into approximate string matching and specifically, computing the Levenshtein distance between two strings.
Upvotes: 0