Reputation: 192
I'm wondering if anyone could provide any input concerning the speed/performance of python's native data containers versus a Pandas DataFrame -- namely in performing a substring lookup.
Several months ago I posted a question pertaining to the operation (Performing a substring lookup between one dataframe and another). Essentially, I have a list of names (a column in the DataFrame, length > 2mm), and would like to 'flag' those names that contain a substring from a separate list of vulgar words (length > 3000). The solution presented to me has worked nicely, and I'm assuming it's the most efficient option for a DataFrame.
Since then however, I have moved on to creating a GUI (with PyQt5) which includes a progressbar. The issue with the progressbar is that I would need some form of iteration that would allow me to determine the % progress completed. At this point, I altered my code to only use native python iterables (no Pandas DataFrame), and did my operations in a forloop, allowing me to have a determinate progressbar.
I assumed this would be much slower, knowing that the performance advantage of a DataFrame arose from the ability to vectorize operations. However, to my surprise, the iterative method using python was ~15%.
What is the reason for this? Is the pandas method not really vectorized, and still performing some looping behind the scenes? Or are list/sets/generators just more lightweight and faster compared to a DataFrame?
Here is my code of both methods:
Pandas DataFrame implementation
import pandas as pd
df = pd.read_csv(source_file, names = ['ID', 'Fullname'])
vulgars = [line for line in open(vulgar_lookup_file, 'r')]
df['Vulgar Flag'] = df['Fullname'].str.contains('|'.join(vulgars))
Native Python iterative method
vulgars = set(line for line in open(vulgar_lookup_file, 'r'))
# accessing second column of comma-delimited file (containing the fullname)
source = (line.split(',')[1] for line in open(source_file, 'r'))
vulgar_flag = []
for item in source:
result = any(substr in item for substr in vulgars)
vulgar_flag.append(result)
I know the iterative method can be further simplified into a list comprehension, and it yields the same result ~12% faster than the above forloop. I just put it in loop form for the sake of readability here.
Thank you!
Upvotes: 3
Views: 1158
Reputation: 19885
Long story short, no, str
methods are not vectorised.
If we look at the pandas
code, we can find that the str
methods eventually delegate to pandas._lib.lib.map_infer
, which is defined as follows:
def map_infer(ndarray arr, object f, bint convert=1):
"""
Substitute for np.vectorize with pandas-friendly dtype inference
Parameters
----------
arr : ndarray
f : function
Returns
-------
mapped : ndarray
"""
cdef:
Py_ssize_t i, n
ndarray[object] result
object val
n = len(arr)
result = np.empty(n, dtype=object)
for i in range(n):
val = f(arr[i])
if cnp.PyArray_IsZeroDim(val):
# unbox 0-dim arrays, GH#690
# TODO: is there a faster way to unbox?
# item_from_zerodim?
val = val.item()
result[i] = val
if convert:
return maybe_convert_objects(result,
try_float=0,
convert_datetime=0,
convert_timedelta=0)
return result
We can see that it is basically a for
loop, albeit in Cython for speed.
Upvotes: 2