Reputation: 2442
I have the following list of list.
xlist =[['instructor','plb','error0992'],['instruction','address','00x0993'],['data','address','017x112']]]
I am trying to implement a string algorithm where at one step it needs to separate the above list into several lists. Separation criteria is to first select the least number of unique token values and separate it using unique token value. (Here the token is an element of the inner list). For example, in the above xlist, the least number of unique token resides in the 2nd index => ('plb','address','address'). So i need to break this list into following two lists.
list1 = [['instruction','address','00x0993'],['data','address','017x112']]
list2= [['instructor','plb','error0992']]
I am new to python. This is my first project. Can anybody suggest me a good method? perhaps a suitable list comprehension? Or a brief explanation of the steps i should follow.
Upvotes: 0
Views: 225
Reputation: 4772
Pure Python, in memory, solution. (For when you have the ram)
To get name sets, I transpose xlist then form a set of each transposed element that will remove any duplication.
mintokenset just finds the set with the smallest number of items.
minindex finds what column of the inner list mintokenset corresponds to.
name lists is initialised to have enough empty inner lists.
The for loop takes that information to split the inner-lists appropriately.
>>> from pprint import pprint as pp
>>>
>>> xlist =[['instructor','plb','error0992'],['instruction','address','00x0993'],['data','address','017x112']]
>>> sets = [set(transposedcolumn) for transposedcolumn in zip(*xlist)]
>>> pp(sets)
[{'instructor', 'data', 'instruction'},
{'plb', 'address'},
{'00x0993', '017x112', 'error0992'}]
>>> mintokenset = min(sets, key=lambda x:len(x))
>>> mintokenset
{'plb', 'address'}
>>> minindex = sets.index(mintokenset)
>>> minindex
1
>>> mintokens = sorted(mintokenset)
>>> mintokens
['address', 'plb']
>>> lists = [[] for _ in mintokenset]
>>> lists
[[], []]
>>> for innerlist in xlist:
lists[mintokens.index(innerlist[minindex])].append(innerlist)
>>> pp(lists)
[[['instruction', 'address', '00x0993'], ['data', 'address', '017x112']],
[['instructor', 'plb', 'error0992']]]
>>>
Following on from the above doodle, for big data, assume it is stored in a file (one inner list per line, comma separated). the file can be read once and mintokenset and minindex found using a complicated generator expression that should reduce the RAM requirement.
The output is similarly stored in as much output files as necessary using another generator expression to read the input file a second time and switch input records to their appropriate output file.
Data should stream through with little overall RAM usage.
from pprint import pprint as pp
def splitlists(logname):
with open(logname) as logf:
#sets = [set(transposedcolumn) for transposedcolumn in zip(*(line.strip().split(',') for line in logf))]
mintokenset, minindex = \
min(((set(transposedcolumn), i)
for i, transposedcolumn in
enumerate(zip(*(line.strip().split(',') for line in logf)))),
key=lambda x:len(x[0]))
mintokens = sorted(mintokenset)
lists = [open(r'C:\Users\Me\Code\splitlists%03i.dat' % i, 'w') for i in range(len(mintokenset))]
with open(logname) as logf:
for innerlist in (line.strip().split(',') for line in logf):
lists[mintokens.index(innerlist[minindex])].write(','.join(innerlist) + '\n')
for filehandle in lists:
filehandle.close()
if __name__ == '__main__':
# File splitlists.log has the following input
'''\
instructor,plb,error0992
instruction,address,00x0993
data,address,017x112'''
logname = 'splitlists.log'
splitlists(logname)
# Creates the following two output files:
# splitlists000.dat
'''\
instruction,address,00x0993
data,address,017x112'''
# splitlists001.dat
'''\
instructor,plb,error0992'''
Upvotes: 2
Reputation: 54380
Since your mentioned it's gonna be a big dataset (how big?), I think pandas
may be the best approach here.
In [1]:
import numpy as np
import pandas as pd
In [4]:
xlist =[['instructor','plb','error0992'],['instruction','address','00x0993'],['data','address','017x112']]
df=pd.DataFrame(xlist, columns=['c1','c2','c3'])
In [6]:
set(df['c2'])
Out[6]:
{'address', 'plb'}
In [11]:
print df[df['c2']=='address']
c1 c2 c3
1 instruction address 00x0993
2 data address 017x112
In [12]:
print df[df['c2']=='plb']
c1 c2 c3
0 instructor plb error0992
Upvotes: 1