glimmer-of-light
glimmer-of-light

Reputation: 23

Find the most repeated sequence of chars in a list of domains

I have a very long list of domains (for example: blockme.com, do.remove-me.com.. etc) millions of them. I would like to use this list to create a list of regular expressions to cover as much as possible of these domains.

I normally use a terminal command to find the most repeated character sequences:

tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -c | sort -nr | head  -10

which gives me:

 125054 com
  11715 net
   7823 us
   7502 co
   6736 2
   6628 west
   6495 sequoia
   6464 org
   5901 uk
   5815 sex

The problem is that this command does not work as intended. For example the word sex is actually repeated 90k times in the file, but it is only detected 5815 times. And com exists 127k times, not 125k.

I did some research and I found out that this problem is called the Longest repeated substring problem. I found many implementations especially here on stackoverflow, but none of them could replicate the output of the terminal command.

Can someone help me write it please in either python or c++?

Upvotes: 2

Views: 71

Answers (2)

ellat
ellat

Reputation: 160

I wrote a brute force answer which takes a couple of minutes but achieves this solution.

You will have to run tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -c | sort -nr | head -10 on the resulting file:

import sys
sys.path.append('../')
from utils import *

minimum_split = 3

def return_splits(string, count):
    cut_len = len(string) - count
    results = []
    for i in range(count+1):
        results.append(string[i:cut_len+i])
    return results

if __name__ == '__main__':
    input_file_name = str(sys.argv[1])
    output_file_name = input_file_name + "_out"

    file_to_clean = readLinesFromTextFile(input_file_name, remove_zeros=False, remove_comments=True)

    splitted = []
    for line in file_to_clean:
        data = line.split()
        for line2 in data:
            splitted.append(line2.strip())

    splitted_final = []
    for line in splitted:
        if (len(line) < 3):
            continue

        if (len(line) == 3):
            splitted_final.append(line)
            continue
        else:
            for t in range (len(line) - (minimum_split-1)):
                if (t == 0):
                    splitted_final.append(line)
                else:
                    res1 = return_splits(line, t)
                    for tt in res1:
                        splitted_final.append(tt)

    with open(output_file_name, "w") as txt_file:
        for line in tqdm(splitted_final):
            txt_file.write(str(line) + "\n")

Upvotes: 1

user24714692
user24714692

Reputation: 4949

You can design some algorithms to get "close" matches, but it won't be 100% accurate.

  • For instance, you can split on numbers and non-alphanumeric chars: r'[0-9._-]'.

  • Then, you can generate a list of "common" words and test it against the domains:

Example:

from collections import Counter
import re


def get_splits(domains):
    p = r'[0-9._-]'
    res = []
    for st in domains:
        res += (re.split(p, st))
    return sorted(res, key=len, reverse=True)


def get_counts(splits, source):
    counts = Counter()
    for word in source:
        for i, s in enumerate(splits):
            if 0 < len(s) < 4:
                counts[s] += 1
                splits[i] = ''
            if s.startswith(word):
                counts[word] += 1
                splits[i] = splits[i][len(word):]
            if s.endswith(word):
                counts[word] += 1
                splits[i] = splits[i][:-len(word)]
            # Here you can use .find(word)
            # However, there will be issues for strings, such as 'bobsecure',
            # which includes words such as 'bob', 'obsecure', 'cure', 'secure', etc.
        splits = sorted(splits, key=len, reverse=True)
    return counts


source = sorted(['yami', 'sex', 'blog', 'blogspot', 'porn', 'live', 'video', 'yellow', 'movie'], key=len, reverse=True)
domains = ['0006666.net', '0310love.com', '081xxx.com', '1001truyen.blogspot.com', '102model.com', '1116666.net', '116almet.ru', '123porn.com', '123sex.top', '12hdem.com', '1phimsex.vip', '1pondohd.com', '2089.vntopg88.live', '210vn.net', '247live.mobi', '24hchudu.com', '24hphimsex.com', '28363.one', '2jav.net', '2ola.pro', '303cc.xyz', '3jvos6.st-sy.com',
           'www.v-av.com', 'www.vandr.co.jp', 'www.videomanial.co.jp', 'www.vr-products.co.jp', 'www.waap.co.jp', 'www.wanz-factory.com', 'www.woman-otona.com', 'www.yamiboard.com', 'www.yamigama.com', 'www.yellow-movie.com', 'www.yellowbox.co.jp', 'www.yellowduck.jp', 'www.zetton-av.com', 'www.zeus-web.net', 'www2.giga-jp.com', 'yamidas.yamiboard.com', 'yamigama.com']
splits = get_splits(domains)
print(get_counts(splits, source))


Prints

Counter({'com': 20, 'www': 15, 'jp': 7, 'net': 5, 'co': 5, 'yami': 5, 'yellow': 4, 'live': 4, 'sex': 3, 'blogspot': 2, 'av': 2, 'movie': 2, 'porn': 2, 'xxx': 1, 'top': 1, 'vip': 1, 'one': 1, 'jav': 1, 'ola': 1, 'pro': 1, 'xyz': 1, 'web': 1, 'ru': 1, 'vn': 1, 'cc': 1, 'st': 1, 'sy': 1, 'vr': 1, 'v': 1, 'video': 1, 'box': 1, 'das': 1})

Upvotes: 2

Related Questions