Reputation: 1975
I have a list of lists where some of the sublists are comprised entirely of None
and some have None
at the end and in between strings. I need to do three things:
Remove the None
s from the end of sublists if any, replacing ones in the middle which are separated by strings with empty strings.
Exclude sublists which are entirely None
s
Create a new list of lists with the results
My attempt gets the desired results, but I am wondering if there is a faster way to do this:
from itertools import islice
rows = [["row 1 index 0",None,"row 1 index 2",None,None],
[None,"row 2 index 1",None,None,None],
[None,None,None,None,None]]
data = []
for r in rows:
for i,c in enumerate(reversed(r)):
if c is not None:
data.append(["" if x is None else
str(x) for x in islice(r,0,len(r)-i)])
break
print (data)
Desired results/output:
[['row 1 index 0', '', 'row 1 index 2'], ['', 'row 2 index 1']]
Benchmark (as best I know how):
from itertools import islice
import time
q = ["string",None,"string",None,"string"] + [None] * 95
rows = [q.copy() for i in range(500000)]
for z in range(1,6):
st = time.time()
data = []
for r in rows:
for i,c in enumerate(reversed(r)):
if c is not None:
data.append(["" if x is None else
str(x) for x in islice(r,0,len(r)-i)])
break
end = time.time()
print ("Run: " + str(z) + "| time: " + str(end-st))
results (i5 ivybridge windows 10):
Run: 1| time: 5.787390232086182
Run: 2| time: 5.802111387252808
Run: 3| time: 5.697156190872192
Run: 4| time: 5.38789963722229
Run: 5| time: 5.739344596862793
Upvotes: 1
Views: 233
Reputation: 96172
The performance depends on the nature of the data! See the following timings, and lean towards the one that you think works better for the data-set you expect to encounter. I propose a partially in-place solution that seems to have the best performance on my tests now, but hopefully it is obvious that benchmarking needs to be fleshed out more to get a real picture of your trade-offs.
First I made a test-set.
In [60]: rows = [["row 1 index 0",None,"row 1 index 2",None,None],
...: [None,"row 2 index 1",None,None,None],
...: [None,None,None,None,None]]
In [61]: rowsbig = [r*1000 for r in rows]
In [62]: rowsbig = [list(r) for _ in range(1000) for r in rowsbig]
In [63]: sum(len(r) for r in rowsbig)
Out[63]: 15000000
Now, a little helper to keep things sanitary:
In [65]: def test_set(source=rowsbig):
...: return [list(r) for r in source]
...:
So, let's wrap the three proposed approaches in functions:
In [86]: def new_to_coding(rows):
...: data = []
...: for r in rows:
...: for i,c in enumerate(reversed(r)):
...: if c is not None:
...: data.append(["" if x is None else
...: str(x) for x in islice(r,0,len(r)-i)])
...: break
...: return data
...:
In [87]: def Bit(rows):
...: data = [list(map(lambda x: '' if x is None else x, row)) for row in rows]
...: data = [row[:max(i for i, e in enumerate(row, 1) if e is not '')] for row in data if set(row) != {''}]
...: return data
...:
In [88]: def taras(rows):
...: # remove lists with all Nones
...: rows1 = [row for row in rows if set(row) != {None}]
...: # remove trailing Nones
...: rows2 = [dropwhile(lambda x: x is None, reversed(row)) for row in rows1]
...: # replace None with ''
...: rows3 = [list(reversed([x if x is not None else '' for x in row])) for row in rows2]
...: return rows3
...:
And a quick sanity check:
In [89]: taras(test_set()) == new_to_coding(test_set())
Out[89]: True
In [90]: Bit(test_set()) == new_to_coding(test_set())
Out[90]: True
Now, some timing setup. Note to @new_to_coding always use the timeit
module for creating benchmarks. There are a lot of subtleties that are ignored by a naive time.time()
approach, as well as being more handy!
In [91]: from timeit import timeit
In [92]: setup = "from __main__ import new_to_coding, Bit, taras, test_set; testrows = test_set()"
Now, the results:
In [93]: # using OP's method
...: timeit('new_to_coding(testrows)', setup, number=5)
Out[93]: 5.416837869910523
In [94]: # using `Bit`
...: timeit('Bit(testrows)', setup, number=5)
Out[94]: 14.52187539380975
In [95]: # using `taras`
...: timeit('taras(testrows)', setup, number=5)
Out[95]: 3.7361009169835597
So, it seems that the incremental approach wins! Of course, the exact nature of the data may change these relative timings. I suspect that the proportion of "all None
" rows will affect the relative performance of these approaches. Warning! This turns out to be very true! See edit
I've gone ahead an micro-optimized @taras approach, making sure all names are local to the function, so no global look-ups, replace list(reversed(alist))
with alist[::-1]
, and made the intermediate transformations generator expressions, so that only a single list is materialized:
In [111]: def is_None(x): return x is None
...:
...: def taras_micro_op(rows, dropwhile=dropwhile, reversed=reversed, set=set, is_None=is_None):
...: # remove lists with all Nones
...: rows1 = (row for row in rows if set(row) != {None})
...: # remove trailing Nones
...: rows2 = (dropwhile(is_None, reversed(row)) for row in rows1)
...: # replace None with ''
...: rows3 = [[x if x is not None else '' for x in row][::-1] for row in rows2]
...: return rows3
...:
In [112]: taras_micro_op(test_set()) == taras(test_set())
Out[112]: True
In [113]: setup = "from __main__ import taras, taras_micro_op, test_set; testrows = test_set()"
In [114]: # using `taras`
...: timeit('taras(testrows)', setup, number=50)
Out[114]: 35.11660181987099
In [115]: # using `taras_micro_op`
...: timeit('taras_micro_op(testrows)', setup, number=50)
Out[115]: 33.70030225184746
In [116]: 33.70030225184746 / 35.11660181987099
Out[116]: 0.9596686611281929
Less than a 5% improvement. Indeed, I would ditch the "inlining with default arguments" and just use intermediate generator expressions, if only to help with memory efficiency.
In other words, I suggest using the following:
In [117]: def taras_memory_op(rows):
...: # remove lists with all Nones
...: rows1 = (row for row in rows if set(row) != {None})
...: # remove trailing Nones
...: rows2 = (dropwhile(lambda x: x is None, reversed(row)) for row in rows1)
...: # replace None with ''
...: rows3 = [[x if x is not None else '' for x in row][::-1] for row in rows2]
...: return rows3
...:
In [118]: setup = "from __main__ import taras, taras_memory_op, test_set; testrows = test_set()"
In [119]: # using `taras`
...: timeit('taras(testrows)', setup, number=50)
Out[119]: 35.10479677491821
In [120]: # using `taras`
...: timeit('taras_memory_op(testrows)', setup, number=50)
Out[120]: 34.00812040804885
In [121]: 34.00812040804885/35.10479677491821
Out[121]: 0.9687599283396816
Since most of the already minor improvements actually come from using the generator expressions anyway!
So, I tried it with the test set provided by the op:
In [3]: q = ["string",None,"string",None,"string"] + [None] * 95
...: rows = [q.copy() for i in range(500000)]
...:
In [4]: sum(len(r) for r in rows)
Out[4]: 50000000
Note, in my original test-set, there were about 33% "all None
" rows. However, in the above, there are no rows that are all None
. This definitely turned out to affect performance.
In [7]: def test_set(source=rows):
...: return [list(r) for r in source]
...:
In [8]: setup = "from __main__ import new_to_coding, taras_memory_op, test_set; testrows = test_set()"
In [9]: # using OP's method
...: timeit('new_to_coding(testrows)', setup, number=5)
Out[9]: 14.014577565016225
In [10]: # using `taras`
...: timeit('taras_memory_op(testrows)', setup, number=5)
Out[10]: 33.28037207596935
So, I propose yet another solution. Warning! the following solution mutates the inner lists in-place:
In [14]: def sanitize(rows):
...: result = []
...: for row in rows:
...: tail = True
...: maxidx = len(row) - 1
...: for i, item in enumerate(reversed(row)):
...: if item is None:
...: if tail:
...: row.pop()
...: else:
...: row[maxidx - i] = ''
...: else:
...: tail = False
...: if row:
...: result.append(row)
...: return result
...:
In [15]: setup = "from __main__ import new_to_coding, taras_memory_op, sanitize, test_set; testrows = test_set()"
In [16]: # using `sanitize`
...: timeit('sanitize(testrows)', setup, number=5)
Out[16]: 8.261458976892754
In [17]: sanitize(test_set()) == new_to_coding(test_set())
Out[17]: True
So, using the test-set I originally made:
In [18]: rows = [["row 1 index 0",None,"row 1 index 2",None,None],
...: [None,"row 2 index 1",None,None,None],
...: [None,None,None,None,None]]
In [19]:
In [19]: rows = [r*1000 for r in rows]
In [20]: rowsbig = [list(r) for _ in range(1000) for r in rows]
In [21]: rows = rowsbig
In [22]: del rowsbig
In [23]: def test_set(source=rows):
...: return [list(r) for r in source]
...:
In [24]: setup = "from __main__ import new_to_coding, taras_memory_op, sanitize, test_set; testrows = test_set()"
In [25]: # using `taras`
...: timeit('taras_memory_op(testrows)', setup, number=10)
Out[25]: 6.563127358909696
In [26]: # using OP's method
...: timeit('new_to_coding(testrows)', setup, number=10)
Out[26]: 10.173962660133839
In [27]: # using `sanitize`
...: timeit('sanitize(testrows)', setup, number=10)
Out[27]: 6.3629974271170795
Upvotes: 3
Reputation: 2576
I'm sure there is a more compact way, but this is my take with list comprehension:
data = [list(map(lambda x: '' if x is None else x, row)) for row in rows]
data = [row[:max(i for i, e in enumerate(row, 1) if e is not '')] for row in data if set(row) != {''}]
Upvotes: 1
Reputation: 6915
You can incrementally eliminate Nones with list comprehensions
from itertools import dropwhile
rows = [["row 1 index 0",None,"row 1 index 2",None,None],
[None,"row 2 index 1",None,None,None],
[None,None,None,None,None]]
# remove lists with all Nones
rows1 = [row for row in rows if set(row) != {None}]
# remove trailing Nones
rows2 = [dropwhile(lambda x: x is None, reversed(row)) for row in rows1]
# replace None with ''
rows3 = [list(reversed([x if x is not None else '' for x in row])) for row in rows2]
print(rows3)
The output:
[['row 1 index 0', '', 'row 1 index 2'], ['', 'row 2 index 1']]
Upvotes: 3