Reputation: 23
I have a very long list of domains (for example: blockme.com
, do.remove-me.com
.. etc) millions of them. I would like to use this list to create a list of regular expressions to cover as much as possible of these domains.
I normally use a terminal command to find the most repeated character sequences:
tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -c | sort -nr | head -10
which gives me:
125054 com
11715 net
7823 us
7502 co
6736 2
6628 west
6495 sequoia
6464 org
5901 uk
5815 sex
The problem is that this command does not work as intended. For example the word sex is actually repeated 90k times in the file, but it is only detected 5815 times. And com exists 127k times, not 125k.
I did some research and I found out that this problem is called the Longest repeated substring problem. I found many implementations especially here on stackoverflow, but none of them could replicate the output of the terminal command.
Can someone help me write it please in either python or c++?
Upvotes: 2
Views: 71
Reputation: 160
I wrote a brute force answer which takes a couple of minutes but achieves this solution.
You will have to run tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -c | sort -nr | head -10
on the resulting file:
import sys
sys.path.append('../')
from utils import *
minimum_split = 3
def return_splits(string, count):
cut_len = len(string) - count
results = []
for i in range(count+1):
results.append(string[i:cut_len+i])
return results
if __name__ == '__main__':
input_file_name = str(sys.argv[1])
output_file_name = input_file_name + "_out"
file_to_clean = readLinesFromTextFile(input_file_name, remove_zeros=False, remove_comments=True)
splitted = []
for line in file_to_clean:
data = line.split()
for line2 in data:
splitted.append(line2.strip())
splitted_final = []
for line in splitted:
if (len(line) < 3):
continue
if (len(line) == 3):
splitted_final.append(line)
continue
else:
for t in range (len(line) - (minimum_split-1)):
if (t == 0):
splitted_final.append(line)
else:
res1 = return_splits(line, t)
for tt in res1:
splitted_final.append(tt)
with open(output_file_name, "w") as txt_file:
for line in tqdm(splitted_final):
txt_file.write(str(line) + "\n")
Upvotes: 1
Reputation: 4949
You can design some algorithms to get "close" matches, but it won't be 100% accurate.
For instance, you can split on numbers and non-alphanumeric chars: r'[0-9._-]'
.
Then, you can generate a list of "common" words and test it against the domains:
from collections import Counter
import re
def get_splits(domains):
p = r'[0-9._-]'
res = []
for st in domains:
res += (re.split(p, st))
return sorted(res, key=len, reverse=True)
def get_counts(splits, source):
counts = Counter()
for word in source:
for i, s in enumerate(splits):
if 0 < len(s) < 4:
counts[s] += 1
splits[i] = ''
if s.startswith(word):
counts[word] += 1
splits[i] = splits[i][len(word):]
if s.endswith(word):
counts[word] += 1
splits[i] = splits[i][:-len(word)]
# Here you can use .find(word)
# However, there will be issues for strings, such as 'bobsecure',
# which includes words such as 'bob', 'obsecure', 'cure', 'secure', etc.
splits = sorted(splits, key=len, reverse=True)
return counts
source = sorted(['yami', 'sex', 'blog', 'blogspot', 'porn', 'live', 'video', 'yellow', 'movie'], key=len, reverse=True)
domains = ['0006666.net', '0310love.com', '081xxx.com', '1001truyen.blogspot.com', '102model.com', '1116666.net', '116almet.ru', '123porn.com', '123sex.top', '12hdem.com', '1phimsex.vip', '1pondohd.com', '2089.vntopg88.live', '210vn.net', '247live.mobi', '24hchudu.com', '24hphimsex.com', '28363.one', '2jav.net', '2ola.pro', '303cc.xyz', '3jvos6.st-sy.com',
'www.v-av.com', 'www.vandr.co.jp', 'www.videomanial.co.jp', 'www.vr-products.co.jp', 'www.waap.co.jp', 'www.wanz-factory.com', 'www.woman-otona.com', 'www.yamiboard.com', 'www.yamigama.com', 'www.yellow-movie.com', 'www.yellowbox.co.jp', 'www.yellowduck.jp', 'www.zetton-av.com', 'www.zeus-web.net', 'www2.giga-jp.com', 'yamidas.yamiboard.com', 'yamigama.com']
splits = get_splits(domains)
print(get_counts(splits, source))
Counter({'com': 20, 'www': 15, 'jp': 7, 'net': 5, 'co': 5, 'yami': 5, 'yellow': 4, 'live': 4, 'sex': 3, 'blogspot': 2, 'av': 2, 'movie': 2, 'porn': 2, 'xxx': 1, 'top': 1, 'vip': 1, 'one': 1, 'jav': 1, 'ola': 1, 'pro': 1, 'xyz': 1, 'web': 1, 'ru': 1, 'vn': 1, 'cc': 1, 'st': 1, 'sy': 1, 'vr': 1, 'v': 1, 'video': 1, 'box': 1, 'das': 1})
Upvotes: 2