ragardner
ragardner

Reputation: 1975

Python 3 remove None values from end of sublist or exclude if sublist is entirely None values

I have a list of lists where some of the sublists are comprised entirely of None and some have None at the end and in between strings. I need to do three things:

  1. Remove the Nones from the end of sublists if any, replacing ones in the middle which are separated by strings with empty strings.

  2. Exclude sublists which are entirely Nones

  3. Create a new list of lists with the results

My attempt gets the desired results, but I am wondering if there is a faster way to do this:

from itertools import islice

rows = [["row 1 index 0",None,"row 1 index 2",None,None],
        [None,"row 2 index 1",None,None,None],
        [None,None,None,None,None]]
data = []

for r in rows:
    for i,c in enumerate(reversed(r)):
        if c is not None:
            data.append(["" if x is None else
                         str(x) for x in islice(r,0,len(r)-i)])
            break
print (data)

Desired results/output:

[['row 1 index 0', '', 'row 1 index 2'], ['', 'row 2 index 1']]

Benchmark (as best I know how):

from itertools import islice
import time

q = ["string",None,"string",None,"string"] + [None] * 95
rows = [q.copy() for i in range(500000)]

for z in range(1,6):
    st = time.time()
    data = []
    for r in rows:
        for i,c in enumerate(reversed(r)):
            if c is not None:
                data.append(["" if x is None else
                             str(x) for x in islice(r,0,len(r)-i)])
                break
    end = time.time()
    print ("Run: " + str(z) + "| time: " + str(end-st))

results (i5 ivybridge windows 10):

Run: 1| time: 5.787390232086182
Run: 2| time: 5.802111387252808
Run: 3| time: 5.697156190872192
Run: 4| time: 5.38789963722229
Run: 5| time: 5.739344596862793

Upvotes: 1

Views: 233

Answers (3)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96172

TL:DR

The performance depends on the nature of the data! See the following timings, and lean towards the one that you think works better for the data-set you expect to encounter. I propose a partially in-place solution that seems to have the best performance on my tests now, but hopefully it is obvious that benchmarking needs to be fleshed out more to get a real picture of your trade-offs.

First I made a test-set.

In [60]: rows = [["row 1 index 0",None,"row 1 index 2",None,None],
    ...:         [None,"row 2 index 1",None,None,None],
    ...:         [None,None,None,None,None]]

In [61]: rowsbig = [r*1000 for r in rows]

In [62]: rowsbig = [list(r) for _ in range(1000) for r in rowsbig]

In [63]: sum(len(r) for r in rowsbig)
Out[63]: 15000000

Now, a little helper to keep things sanitary:

In [65]: def test_set(source=rowsbig):
     ...:     return [list(r) for r in source]
     ...:

So, let's wrap the three proposed approaches in functions:

In [86]: def new_to_coding(rows):
    ...:     data = []
    ...:     for r in rows:
    ...:         for i,c in enumerate(reversed(r)):
    ...:             if c is not None:
    ...:                 data.append(["" if x is None else
    ...:                          str(x) for x in islice(r,0,len(r)-i)])
    ...:                 break
    ...:     return data
    ...:

In [87]: def Bit(rows):
    ...:     data = [list(map(lambda x: '' if x is None else x, row)) for row in rows]
    ...:     data = [row[:max(i for i, e in enumerate(row, 1) if e is not '')] for row in data if set(row) != {''}]
    ...:     return data
    ...:

In [88]: def taras(rows):
    ...:     # remove lists with all Nones
    ...:     rows1 = [row for row in rows if set(row) != {None}]
    ...:     # remove trailing Nones
    ...:     rows2 = [dropwhile(lambda x: x is None, reversed(row)) for row in rows1]
    ...:     # replace None with ''
    ...:     rows3 = [list(reversed([x if x is not None else '' for x in row])) for row in rows2]
    ...:     return rows3
    ...:

And a quick sanity check:

In [89]: taras(test_set()) == new_to_coding(test_set())
Out[89]: True

In [90]: Bit(test_set()) == new_to_coding(test_set())
Out[90]: True

Now, some timing setup. Note to @new_to_coding always use the timeit module for creating benchmarks. There are a lot of subtleties that are ignored by a naive time.time() approach, as well as being more handy!

In [91]: from timeit import timeit

In [92]: setup = "from __main__ import new_to_coding, Bit, taras, test_set; testrows = test_set()"

Now, the results:

In [93]: # using OP's method
    ...: timeit('new_to_coding(testrows)', setup, number=5)
Out[93]: 5.416837869910523

In [94]: # using `Bit`
    ...: timeit('Bit(testrows)', setup, number=5)
Out[94]: 14.52187539380975

In [95]: # using `taras`
    ...: timeit('taras(testrows)', setup, number=5)
Out[95]: 3.7361009169835597

So, it seems that the incremental approach wins! Of course, the exact nature of the data may change these relative timings. I suspect that the proportion of "all None" rows will affect the relative performance of these approaches. Warning! This turns out to be very true! See edit

I've gone ahead an micro-optimized @taras approach, making sure all names are local to the function, so no global look-ups, replace list(reversed(alist)) with alist[::-1], and made the intermediate transformations generator expressions, so that only a single list is materialized:

In [111]: def is_None(x): return x is None
     ...:
     ...: def taras_micro_op(rows, dropwhile=dropwhile, reversed=reversed, set=set, is_None=is_None):
     ...:     # remove lists with all Nones
     ...:     rows1 = (row for row in rows if set(row) != {None})
     ...:     # remove trailing Nones
     ...:     rows2 = (dropwhile(is_None, reversed(row)) for row in rows1)
     ...:     # replace None with ''
     ...:     rows3 = [[x if x is not None else '' for x in row][::-1] for row in rows2]
     ...:     return rows3
     ...:

In [112]: taras_micro_op(test_set()) == taras(test_set())
Out[112]: True

In [113]: setup = "from __main__ import taras, taras_micro_op, test_set; testrows = test_set()"

In [114]: # using `taras`
     ...: timeit('taras(testrows)', setup, number=50)
Out[114]: 35.11660181987099

In [115]: # using `taras_micro_op`
     ...: timeit('taras_micro_op(testrows)', setup, number=50)
Out[115]: 33.70030225184746

In [116]: 33.70030225184746 / 35.11660181987099
Out[116]: 0.9596686611281929

Less than a 5% improvement. Indeed, I would ditch the "inlining with default arguments" and just use intermediate generator expressions, if only to help with memory efficiency.

In other words, I suggest using the following:

In [117]: def taras_memory_op(rows):
     ...:     # remove lists with all Nones
     ...:     rows1 = (row for row in rows if set(row) != {None})
     ...:     # remove trailing Nones
     ...:     rows2 = (dropwhile(lambda x: x is None, reversed(row)) for row in rows1)
     ...:     # replace None with ''
     ...:     rows3 = [[x if x is not None else '' for x in row][::-1] for row in rows2]
     ...:     return rows3
     ...:

In [118]: setup = "from __main__ import taras, taras_memory_op, test_set; testrows = test_set()"

In [119]: # using `taras`
     ...: timeit('taras(testrows)', setup, number=50)
Out[119]: 35.10479677491821

In [120]: # using `taras`
     ...: timeit('taras_memory_op(testrows)', setup, number=50)
Out[120]: 34.00812040804885

In [121]: 34.00812040804885/35.10479677491821
Out[121]: 0.9687599283396816

Since most of the already minor improvements actually come from using the generator expressions anyway!

Edit

So, I tried it with the test set provided by the op:

In [3]: q = ["string",None,"string",None,"string"] + [None] * 95
   ...: rows = [q.copy() for i in range(500000)]
   ...:

In [4]: sum(len(r) for r in rows)
Out[4]: 50000000

Note, in my original test-set, there were about 33% "all None" rows. However, in the above, there are no rows that are all None. This definitely turned out to affect performance.

In [7]: def test_set(source=rows):
   ...:     return [list(r) for r in source]
   ...:

In [8]: setup = "from __main__ import new_to_coding, taras_memory_op, test_set; testrows = test_set()"

In [9]: # using OP's method
   ...: timeit('new_to_coding(testrows)', setup, number=5)
Out[9]: 14.014577565016225

In [10]: # using `taras`
    ...: timeit('taras_memory_op(testrows)', setup, number=5)
Out[10]: 33.28037207596935

So, I propose yet another solution. Warning! the following solution mutates the inner lists in-place:

In [14]: def sanitize(rows):
    ...:     result = []
    ...:     for row in rows:
    ...:         tail = True
    ...:         maxidx = len(row) - 1
    ...:         for i, item in enumerate(reversed(row)):
    ...:             if item is None:
    ...:                 if tail:
    ...:                     row.pop()
    ...:                 else:
    ...:                     row[maxidx - i] = ''
    ...:             else:
    ...:                 tail = False
    ...:         if row:
    ...:             result.append(row)
    ...:     return result
    ...:

In [15]: setup = "from __main__ import new_to_coding, taras_memory_op, sanitize, test_set; testrows = test_set()"

In [16]: # using `sanitize`
    ...: timeit('sanitize(testrows)', setup, number=5)
Out[16]: 8.261458976892754

In [17]: sanitize(test_set()) == new_to_coding(test_set())
Out[17]: True

So, using the test-set I originally made:

In [18]: rows = [["row 1 index 0",None,"row 1 index 2",None,None],
    ...:         [None,"row 2 index 1",None,None,None],
    ...:         [None,None,None,None,None]]

In [19]:

In [19]: rows = [r*1000 for r in rows]

In [20]: rowsbig = [list(r) for _ in range(1000) for r in rows]

In [21]: rows = rowsbig

In [22]: del rowsbig

In [23]: def test_set(source=rows):
    ...:     return [list(r) for r in source]
    ...:

In [24]: setup = "from __main__ import new_to_coding, taras_memory_op, sanitize, test_set; testrows = test_set()"

In [25]: # using `taras`
    ...: timeit('taras_memory_op(testrows)', setup, number=10)
Out[25]: 6.563127358909696

In [26]: # using OP's method
    ...: timeit('new_to_coding(testrows)', setup, number=10)
Out[26]: 10.173962660133839

In [27]: # using `sanitize`
    ...: timeit('sanitize(testrows)', setup, number=10)
Out[27]: 6.3629974271170795

Upvotes: 3

EsotericVoid
EsotericVoid

Reputation: 2576

I'm sure there is a more compact way, but this is my take with list comprehension:

data = [list(map(lambda x: '' if x is None else x, row)) for row in rows]
data = [row[:max(i for i, e in enumerate(row, 1) if e is not '')] for row in data if set(row) != {''}]

Upvotes: 1

taras
taras

Reputation: 6915

You can incrementally eliminate Nones with list comprehensions

from itertools import dropwhile

rows = [["row 1 index 0",None,"row 1 index 2",None,None],
        [None,"row 2 index 1",None,None,None],
        [None,None,None,None,None]]

# remove lists with all Nones
rows1 = [row for row in rows if set(row) != {None}]     
# remove trailing Nones
rows2 = [dropwhile(lambda x: x is None, reversed(row)) for row in rows1]
# replace None with ''
rows3 = [list(reversed([x if x is not None else '' for x in row])) for row in rows2]
print(rows3)

The output:

[['row 1 index 0', '', 'row 1 index 2'], ['', 'row 2 index 1']]

Upvotes: 3

Related Questions