Reputation: 21

Grouping items by match set

I am trying to parse a large amount of configuration files and group the results into separate groups based by content - I just do not know how to approach this. For example, say I have the following data in 3 files:

config1.txt
ntp 1.1.1.1
ntp 2.2.2.2

config2.txt
ntp 1.1.1.1

config3.txt
ntp 2.2.2.2
ntp 1.1.1.1

config4.txt
ntp 2.2.2.2

The results would be:
Sets of unique data 3:
Set 1 (1.1.1.1, 2.2.2.2): config1.txt, config3.txt
Set 2 (1.1.1.1): config2.txt
Set 3 (2.2.2.2): config4.txt

I understand how to glob the directory of files, loop the glob results and open each file at a time, and use regex to match each line. The part I do not understand is how I could store these results and compare each file to a set of result, even if the entries are out of order, but a match entry wise. Any help would be appreciated.

Thanks!

Upvotes: 1

Answers (5)

Codie CodeMonkey

Reputation: 7946

This alternative is more verbose than others, but it may be more efficient depending on a couple of factors (see my notes at the end). Unless you're processing a large number of files with a large number of configuration items, I wouldn't even consider using this over some of the other suggestions, but if performance is an issue this algorithm might help.

Start with a dictionary from the configuration strings to the file set (call it c2f, and from the file to the configuration strings set (f2c). Both can be built as you glob the files.

To be clear, c2f is a dictionary where the keys are strings and the values are sets of files. f2c is a dictionary where the keys are files, and the values are sets of strings.

Loop over the file keys of f2c and one data item. Use c2f to find all files that contain that item. Those are the only files you need to compare.

Here's the working code:

# this structure simulates the files system and contents.
cfg_data = {
    "config1.txt": ["1.1.1.1", "2.2.2.2"],
    "config2.txt": ["1.1.1.1"],
    "config3.txt": ["2.2.2.2", "1.1.1.1"],
    "config4.txt": ["2.2.2.2"]
}

# Build the dictionaries (this is O(n) over the lines of configuration data)
f2c = dict()
c2f = dict()

for file, data in cfg_data.iteritems():
    data_set = set()
    for item in data:
        data_set.add(item)
        if not item in c2f:
            c2f[item] = set()

        c2f[item].add(file)
    f2c[file] = data_set;

# build the results as a list of pairs of lists:
results = []

# track the processed files
processed = set()

for file, data in f2c.iteritems():
    if file in processed:
        continue

    size = len(data)
    equivalence_list = []

    # get one item from data, preferably the one used by the smallest list of
    # files.
    item = None
    item_files = 0
    for i in data:
        if item == None:
            item = i
            item_files = len(c2f[item])
        elif len(c2f[i]) < item_files:
            item = i
            item_files = len(c2f[i])

    # All files with the same data as f must have at least the first item of
    # data, just look at those files.
    for other_file in c2f[item]:
        other_data = f2c[other_file]
        if other_data == data:
            equivalence_list.append(other_file)
            # No need to visit these files again
            processed.add(other_file)

    results.append((data, equivalence_list))

# Display the results
for data, files in results:
    print data, ':', files

Adding a note on computational complexity: This is technically O((K log N)*(L log M)) where N is the number of files, M is the number of unique configuration items, K (<= N) is the number of groups of files with the same content and L (<= M) is the average number of files that have to be compared pairwise for each of the L processed files. This should be efficient if K << N and L << M.

Upvotes: 2

Gareth Latty

Reputation: 88977

from collections import defaultdict

#Load the data.
paths = ["config1.txt", "config2.txt", "config3.txt", "config4.txt"]
files = {}

for path in paths:
    with open(path) as file:
        for line in file.readlines():
            ... #Get data from files
            files[path] = frozenset(data)

#Example data.
files = {
    "config1.txt": frozenset(["1.1.1.1", "2.2.2.2"]),
    "config2.txt": frozenset(["1.1.1.1"]),
    "config3.txt": frozenset(["2.2.2.2", "1.1.1.1"]),
    "config4.txt": frozenset(["2.2.2.2"]),
}

sets = defaultdict(list)

for key, value in files.items():
    sets[value].append(key)

Note you need to use frozensets as they are immutable, and hence can be used as dictionary keys. As they are not going to change, this is fine.

Upvotes: 2

rocksportrocker

Reputation: 7419

You need a dictionary mapping the contents of the files to the filename. So you have to read each file, sort the entries, build a tuple from them and use this as a key.

If you can have duplicate entries in a file: read the contents into a set first.

Upvotes: 1

Jeff Mercado

Reputation: 134811

filenames = [ r'config1.txt',
              r'config2.txt',
              r'config3.txt',
              r'config4.txt' ]
results = {}
for filename in filenames:
    with open(filename, 'r') as f:
        contents = ( line.split()[1] for line in f )
        key = frozenset(contents)
        results.setdefault(key, []).append(filename)

Upvotes: 2

Glaslos

Reputation: 2923

I'd approach this like this:

First, get a dictionary like this:

{(1.1.1.1) : (file1, file2, file3), (2.2.2.2) : (file1, file3, file4) }

Then loop over the file generating the sets:

{(file1) : ((1.1.1.1), (2.2.2.2)), etc }

The compare the values of the sets.

if val(file1) == val(file3):
    Set1 = {(1.1.1.1), (2.2.2.2) : (file1, file2), etc }

This is probably not the fastest and mot elegant solution, but it should work.

Upvotes: 1

Grouping items by match set

Answers (5)

Related Questions