Python 3.6: Speed up regex replace in Pandas dataframe

Question

I'm using the following code to read 1 million rows of SQL data and replace any control characters which might appear in the data, the only problem is that is slow and it is definitely the 'replace' which is slowing it down. Anyone have any suggestion about taking a different approach or tweaking to might make the code any faster?

d = {}
x = map(chr, list(range(0,9)) + list(range(11,13)) + list(range(14,32)) + 
list(range(127,160)))
for item in list(x):
    d.update({item:' '})

with open("out_cleaned.csv", "w", encoding='utf-8') as fh:
    chunks = pd.read_sql_query(SQLCommand, connection, chunksize=10000)  
    c = next(chunks)
    c.replace(d, regex=True, inplace=True)
    c.to_csv(fh, index=False, header=False, sep='	', chunksize=10000)  

    for chunk in chunks:
        chunk.replace(d, regex=True, inplace=True)
        chunk.to_csv(fh, index=False, header=False, sep='	', chunksize=10000)

Takes 16 minutes to read, clean and write out 1 million rows (of 31 fields.)

zwer · Accepted Answer

You don't need regex for this to begin with - you're just replacing 'special' characters with an empty space in an one-to-one replacement - but apart from that you hardly need to parse and turn your data into a DataFrame to begin with.

You can work directly with a DB connection and export the columns using the built-in csv module without ever venturing into pandas, SQLAlchemy and similar heavyweights that add unnecessary overhead for your use case.

So, first things first, instead of regex you can create a translation table and use it with str.translate() to clean up any string:

chr_ranges = (0x00, 0x09), (0x0B, 0x20), (0x7F, 0xA0)  # 'special' character ranges
trans_table = {x: " " for r in chr_ranges for x in range(*r)} # 'special'->space trans. table

This allows you to quickly and effortlessly translate all of the special characters, defined within the ranges of chr_ranges, into a space on any string, for example:

print("Your string with >\x05\x06\x1A< special characters!".translate(trans_table))
# Your string with >   < special characters!

And while we're at it, we can create a small function to handle the translation attempts for any passed field so we don't need to check the type when iterating over our database data:

def trans_field(value):
    try:
        return value.translate(trans_table)  # try to translate and return
    except AttributeError:  # if there's no translate method on the passed value...
        return value  # return the original value

Now all we need is to connect to the database, and execute our query, which depends on the database that you're using - I'll write the next example as if you were using SQLite but most database drivers use Python Database API and are largely interchangeable so the code should work with minimal changes:

import sqlite3

connection = sqlite3.connect("your_db")  # connect to the database
cursor = connection.cursor()  # grab a database cursor
results = cursor.execute("select * from your_table")  # execute the select query
header = [c[0] for c in cursor.description]  # get the column names for our CSV header

And, finally, we can iterate over the results, process each field and store all into a CSV:

import csv

with open("output.csv", "w", newline="") as f:  # open("output.csv", "wb") on Python 2.x
    writer = csv.writer(f, delimiter="	")  # create a CSV writer with 	 as a delimiter
    writer.writerow(header)  # write the header (column names)
    for result in results:  # iterate over the returned results
        writer.writerow(map(trans_field, result))  # process result fields and write the row

This avoids all unnecessary conversions and should work as fast as Python and your database is capable of. You could, technically, squeeze out some more speed by inspecting the cursor.description and creating a replacement map only for the strings in your result set (instead of attempting to process each field) but it probably won't add much to the overall speed.

So, putting it all together:

import csv
import sqlite3

chr_ranges = (0x00, 0x09), (0x0B, 0x20), (0x7F, 0xA0)  # 'special' character ranges
trans_table = {x: " " for r in chr_ranges for x in range(*r)} # 'special'->space trans. table

def trans_field(value):
    try:
        return value.translate(trans_table)  # try to translate and return
    except AttributeError:  # if there's no translate method on the passed value...
        return value  # return the original value

connection = sqlite3.connect("your_db")  # connect to the database
cursor = connection.cursor()  # grab a database cursor
results = cursor.execute("select * from your_table")  # execute the select query
header = [c[0] for c in cursor.description]  # get the column names for our CSV header

with open("output.csv", "w", newline="") as f:  # open("output.csv", "wb") on Python 2.x
    writer = csv.writer(f, delimiter="	")  # create a CSV writer with 	 as a delimiter
    writer.writerow(header)  # write the header (column names)
    for result in results:  # iterate over the returned results
        writer.writerow(map(trans_field, result))  # process result fields and write the row

As a test I've created a 31x1M table in SQLite with 22 TEXT fields (each filled with 10-50 random characters in the 0x00 - 0xA0 range), interspersed with INTEGER and REAL fields and, on my system, it cleaned up the data and produced the output.csv in under 56 seconds. YMMV, of course, but it certainly shouldn't take 16 minutes.

Python 3.6: Speed up regex replace in Pandas dataframe

Answers (1)

Related Questions