Manoj
Manoj

Reputation: 2442

Python: separating a list by unique values

I have the following list of list.

xlist =[['instructor','plb','error0992'],['instruction','address','00x0993'],['data','address','017x112']]]

I am trying to implement a string algorithm where at one step it needs to separate the above list into several lists. Separation criteria is to first select the least number of unique token values and separate it using unique token value. (Here the token is an element of the inner list). For example, in the above xlist, the least number of unique token resides in the 2nd index => ('plb','address','address'). So i need to break this list into following two lists.

list1 = [['instruction','address','00x0993'],['data','address','017x112']]
list2=  [['instructor','plb','error0992']]

I am new to python. This is my first project. Can anybody suggest me a good method? perhaps a suitable list comprehension? Or a brief explanation of the steps i should follow.

Upvotes: 0

Views: 225

Answers (2)

Paddy3118
Paddy3118

Reputation: 4772

Pure Python, in memory, solution. (For when you have the ram)

To get name sets, I transpose xlist then form a set of each transposed element that will remove any duplication.

mintokenset just finds the set with the smallest number of items.

minindex finds what column of the inner list mintokenset corresponds to.

name lists is initialised to have enough empty inner lists.

The for loop takes that information to split the inner-lists appropriately.

>>> from pprint import pprint as pp
>>> 
>>> xlist =[['instructor','plb','error0992'],['instruction','address','00x0993'],['data','address','017x112']]
>>> sets = [set(transposedcolumn) for transposedcolumn in zip(*xlist)]
>>> pp(sets)
[{'instructor', 'data', 'instruction'},
 {'plb', 'address'},
 {'00x0993', '017x112', 'error0992'}]
>>> mintokenset = min(sets, key=lambda x:len(x))
>>> mintokenset
{'plb', 'address'}
>>> minindex = sets.index(mintokenset)
>>> minindex
1
>>> mintokens = sorted(mintokenset)
>>> mintokens
['address', 'plb']
>>> lists = [[] for _ in mintokenset]
>>> lists
[[], []]
>>> for innerlist in xlist:
    lists[mintokens.index(innerlist[minindex])].append(innerlist)


>>> pp(lists)
[[['instruction', 'address', '00x0993'], ['data', 'address', '017x112']],
 [['instructor', 'plb', 'error0992']]]
>>> 

Following on from the above doodle, for big data, assume it is stored in a file (one inner list per line, comma separated). the file can be read once and mintokenset and minindex found using a complicated generator expression that should reduce the RAM requirement.

The output is similarly stored in as much output files as necessary using another generator expression to read the input file a second time and switch input records to their appropriate output file.

Data should stream through with little overall RAM usage.

from pprint import pprint as pp

def splitlists(logname):
    with open(logname) as logf:
        #sets = [set(transposedcolumn) for transposedcolumn in zip(*(line.strip().split(',') for line in logf))]
        mintokenset, minindex = \
            min(((set(transposedcolumn), i)
                 for i, transposedcolumn in
                 enumerate(zip(*(line.strip().split(',') for line in logf)))),
                key=lambda x:len(x[0]))
    mintokens = sorted(mintokenset)
    lists = [open(r'C:\Users\Me\Code\splitlists%03i.dat' % i, 'w') for i in range(len(mintokenset))]
    with open(logname) as logf:
        for innerlist in (line.strip().split(',') for line in logf):
                lists[mintokens.index(innerlist[minindex])].write(','.join(innerlist) + '\n')
    for filehandle in lists:
        filehandle.close()

if __name__ == '__main__':
    # File splitlists.log has the following input
    '''\
instructor,plb,error0992
instruction,address,00x0993
data,address,017x112'''

    logname = 'splitlists.log'
    splitlists(logname)

    # Creates the following two output files:
    #   splitlists000.dat
    '''\
instruction,address,00x0993
data,address,017x112'''
    #   splitlists001.dat
    '''\
instructor,plb,error0992'''

Upvotes: 2

CT Zhu
CT Zhu

Reputation: 54380

Since your mentioned it's gonna be a big dataset (how big?), I think pandas may be the best approach here.

In [1]:
import numpy as np
import pandas as pd

In [4]:
xlist =[['instructor','plb','error0992'],['instruction','address','00x0993'],['data','address','017x112']]
df=pd.DataFrame(xlist, columns=['c1','c2','c3'])

In [6]:
set(df['c2'])

Out[6]:   
{'address', 'plb'}

In [11]:   
print df[df['c2']=='address']

            c1       c2       c3
1  instruction  address  00x0993
2         data  address  017x112

In [12]:   
print df[df['c2']=='plb']

           c1   c2         c3
0  instructor  plb  error0992

Upvotes: 1

Related Questions