Python performance of native data container vs Pandas DataFrame

Question

I'm wondering if anyone could provide any input concerning the speed/performance of python's native data containers versus a Pandas DataFrame -- namely in performing a substring lookup.

Several months ago I posted a question pertaining to the operation (Performing a substring lookup between one dataframe and another). Essentially, I have a list of names (a column in the DataFrame, length > 2mm), and would like to 'flag' those names that contain a substring from a separate list of vulgar words (length > 3000). The solution presented to me has worked nicely, and I'm assuming it's the most efficient option for a DataFrame.

Since then however, I have moved on to creating a GUI (with PyQt5) which includes a progressbar. The issue with the progressbar is that I would need some form of iteration that would allow me to determine the % progress completed. At this point, I altered my code to only use native python iterables (no Pandas DataFrame), and did my operations in a forloop, allowing me to have a determinate progressbar.

I assumed this would be much slower, knowing that the performance advantage of a DataFrame arose from the ability to vectorize operations. However, to my surprise, the iterative method using python was ~15%.

What is the reason for this? Is the pandas method not really vectorized, and still performing some looping behind the scenes? Or are list/sets/generators just more lightweight and faster compared to a DataFrame?

Here is my code of both methods:

Pandas DataFrame implementation

import pandas as pd

df = pd.read_csv(source_file, names = ['ID', 'Fullname'])

vulgars = [line for line in open(vulgar_lookup_file, 'r')]

df['Vulgar Flag'] = df['Fullname'].str.contains('|'.join(vulgars))

Native Python iterative method

vulgars = set(line for line in open(vulgar_lookup_file, 'r'))

# accessing second column of comma-delimited file (containing the fullname)
source = (line.split(',')[1] for line in open(source_file, 'r'))

vulgar_flag = []
for item in source:
    result = any(substr in item for substr in vulgars)
    vulgar_flag.append(result)

I know the iterative method can be further simplified into a list comprehension, and it yields the same result ~12% faster than the above forloop. I just put it in loop form for the sake of readability here.

Thank you!

gmds · Accepted Answer

Long story short, no, str methods are not vectorised.

If we look at the pandas code, we can find that the str methods eventually delegate to pandas._lib.lib.map_infer, which is defined as follows:

def map_infer(ndarray arr, object f, bint convert=1):
    """
    Substitute for np.vectorize with pandas-friendly dtype inference
    Parameters
    ----------
    arr : ndarray
    f : function
    Returns
    -------
    mapped : ndarray
    """
    cdef:
        Py_ssize_t i, n
        ndarray[object] result
        object val

    n = len(arr)
    result = np.empty(n, dtype=object)
    for i in range(n):
        val = f(arr[i])

        if cnp.PyArray_IsZeroDim(val):
            # unbox 0-dim arrays, GH#690
            # TODO: is there a faster way to unbox?
            #   item_from_zerodim?
            val = val.item()

        result[i] = val

    if convert:
        return maybe_convert_objects(result,
                                     try_float=0,
                                     convert_datetime=0,
                                     convert_timedelta=0)

    return result

We can see that it is basically a for loop, albeit in Cython for speed.

Python performance of native data container vs Pandas DataFrame

Answers (1)

Related Questions