Reputation: 561
I'm using the following code to read 1 million rows of SQL data and replace any control characters which might appear in the data, the only problem is that is slow and it is definitely the 'replace' which is slowing it down. Anyone have any suggestion about taking a different approach or tweaking to might make the code any faster?
d = {}
x = map(chr, list(range(0,9)) + list(range(11,13)) + list(range(14,32)) +
list(range(127,160)))
for item in list(x):
d.update({item:' '})
with open("out_cleaned.csv", "w", encoding='utf-8') as fh:
chunks = pd.read_sql_query(SQLCommand, connection, chunksize=10000)
c = next(chunks)
c.replace(d, regex=True, inplace=True)
c.to_csv(fh, index=False, header=False, sep='\t', chunksize=10000)
for chunk in chunks:
chunk.replace(d, regex=True, inplace=True)
chunk.to_csv(fh, index=False, header=False, sep='\t', chunksize=10000)
Takes 16 minutes to read, clean and write out 1 million rows (of 31 fields.)
Upvotes: 0
Views: 540
Reputation: 25799
You don't need regex for this to begin with - you're just replacing 'special' characters with an empty space in an one-to-one replacement - but apart from that you hardly need to parse and turn your data into a DataFrame to begin with.
You can work directly with a DB connection and export the columns using the built-in csv
module without ever venturing into pandas
, SQLAlchemy
and similar heavyweights that add unnecessary overhead for your use case.
So, first things first, instead of regex you can create a translation table and use it with str.translate()
to clean up any string:
chr_ranges = (0x00, 0x09), (0x0B, 0x20), (0x7F, 0xA0) # 'special' character ranges
trans_table = {x: " " for r in chr_ranges for x in range(*r)} # 'special'->space trans. table
This allows you to quickly and effortlessly translate all of the special characters, defined within the ranges of chr_ranges
, into a space on any string, for example:
print("Your string with >\x05\x06\x1A< special characters!".translate(trans_table))
# Your string with > < special characters!
And while we're at it, we can create a small function to handle the translation attempts for any passed field so we don't need to check the type when iterating over our database data:
def trans_field(value):
try:
return value.translate(trans_table) # try to translate and return
except AttributeError: # if there's no translate method on the passed value...
return value # return the original value
Now all we need is to connect to the database, and execute our query, which depends on the database that you're using - I'll write the next example as if you were using SQLite but most database drivers use Python Database API and are largely interchangeable so the code should work with minimal changes:
import sqlite3
connection = sqlite3.connect("your_db") # connect to the database
cursor = connection.cursor() # grab a database cursor
results = cursor.execute("select * from your_table") # execute the select query
header = [c[0] for c in cursor.description] # get the column names for our CSV header
And, finally, we can iterate over the results, process each field and store all into a CSV:
import csv
with open("output.csv", "w", newline="") as f: # open("output.csv", "wb") on Python 2.x
writer = csv.writer(f, delimiter="\t") # create a CSV writer with \t as a delimiter
writer.writerow(header) # write the header (column names)
for result in results: # iterate over the returned results
writer.writerow(map(trans_field, result)) # process result fields and write the row
This avoids all unnecessary conversions and should work as fast as Python and your database is capable of. You could, technically, squeeze out some more speed by inspecting the cursor.description
and creating a replacement map only for the strings in your result set (instead of attempting to process each field) but it probably won't add much to the overall speed.
So, putting it all together:
import csv
import sqlite3
chr_ranges = (0x00, 0x09), (0x0B, 0x20), (0x7F, 0xA0) # 'special' character ranges
trans_table = {x: " " for r in chr_ranges for x in range(*r)} # 'special'->space trans. table
def trans_field(value):
try:
return value.translate(trans_table) # try to translate and return
except AttributeError: # if there's no translate method on the passed value...
return value # return the original value
connection = sqlite3.connect("your_db") # connect to the database
cursor = connection.cursor() # grab a database cursor
results = cursor.execute("select * from your_table") # execute the select query
header = [c[0] for c in cursor.description] # get the column names for our CSV header
with open("output.csv", "w", newline="") as f: # open("output.csv", "wb") on Python 2.x
writer = csv.writer(f, delimiter="\t") # create a CSV writer with \t as a delimiter
writer.writerow(header) # write the header (column names)
for result in results: # iterate over the returned results
writer.writerow(map(trans_field, result)) # process result fields and write the row
As a test I've created a 31x1M
table in SQLite with 22 TEXT
fields (each filled with 10-50 random characters in the 0x00 - 0xA0
range), interspersed with INTEGER
and REAL
fields and, on my system, it cleaned up the data and produced the output.csv
in under 56 seconds. YMMV, of course, but it certainly shouldn't take 16 minutes.
Upvotes: 1