Reputation: 1063
So, I have a Python script which basically writes a file upwards of 80GB. Currently it just runs serially and took about 13 hours to run on a server the only time I've actually run it.
I'm going to parallelize it so that it writes many files rather than just one.
It would be slightly easier to take what I already have and keep it in Python but incorporate multiple threads (there is a single map of shared data that they'll need to access which nobody will be writing to so it doesn't need protection).
However, is it silly to keep it in Python? I know C++ as well, so do you think I should just rewrite it in C++? I figure the program is more disk bound than anything else (there isn't tons of logic used to write the file) so perhaps it doesn't make much of a difference. I'm not sure how long to expect that C++ would take to write the same 80GB file (in serial).
UPDATE 6/6/14, 16:40 PST: I am posting my code below to determine if there is a bottleneck in the code itself as opposed to it being purely disk bound.
I am calling writeEntriesToSql() once per table, where there are about 30 tables. "size" is the number of inserts to write to the table. Cumulative sizes for all tables is about 200,000,000.
I did notice that I'm compiling my regular expression over and over again which could be causing me a lot of waste, although I'm not sure how much.
def writeEntriesToSql(db, table, size, outputFile):
# get a description of the table
rows = queryDatabaseMultipleRows(db, 'DESC ' + table)
fieldNameCol = 0 # no enums in python 2.7 :(
typeCol = 1
nullCol = 2
keyCol = 3
defaultCol = 4
extraCol = 5
fieldNamesToTypes = {}
for row in rows:
if (row[extraCol].find("auto_increment") == -1):
# insert this one
fieldNamesToTypes[row[fieldNameCol]] = row[typeCol]
for i in range(size):
fieldNames = ""
fieldVals = ""
count = 0
# go through the fields
for fieldName, type in fieldNamesToTypes.iteritems():
# build a string of field names to be used in the INSERT statement
fieldNames += table + "." + fieldName
if fieldName in foreignKeys[table]:
otherTable = foreignKeys[table][fieldName][0]
otherTableKey = foreignKeys[table][fieldName][1]
if len(foreignKeys[table][fieldName]) == 3:
# we already got the value so we don't have to get it again
val = foreignKeys[table][fieldName][2]
else:
# get the value from the other table and store it
#### I plan for this to be an infrequent query - unless something is broken here!
query = "SELECT " + otherTableKey + " FROM " + otherTable + " LIMIT 1"
val = queryDatabaseSingleRowCol(db, query)
foreignKeys[table][fieldName].append(val)
fieldVals += val
else:
fieldVals += getDefaultFieldVal(type)
count = count + 1
if count != len(fieldNamesToTypes):
fieldNames += ","
fieldVals += ","
# return the default field value for a given field type which will be used to prepopulate our tables
def getDefaultFieldVal(type):
if not ('insertTime' in globals()):
global insertTime
insertTime = datetime.utcnow()
# store this time in a file so that it can be retrieved by SkyReporterTest.perfoutput.py
try:
timeFileName = perfTestDir + "/dbTime.txt"
timeFile = open(timeFileName, 'w')
timeFile.write(str(insertTime))
except:
print "!!! cannot open file " + timeFileName + " for writing. Please make sure this is run where you have write permissions\n"
os.exit(1)
# many of the types are formatted with a typename, followed by a size in parentheses
##### Looking at this more closely, I suppose I could be compiling this once instead of over and over - a bit bottleneck here?
p = re.compile("(.*)\(([0-9]+).*")
size = 0
if (p.match(type)):
size = int(p.sub(r"\2", type))
type = p.sub(r"\1", type)
else:
size = 0
if (type == "tinyint"):
return str(random.randint(1, math.pow(2,7)))
elif (type == "smallint"):
return str(random.randint(1, math.pow(2,15)))
elif (type == "mediumint"):
return str(random.randint(1, math.pow(2,23)))
elif (type == "int" or type == "integer"):
return str(random.randint(1, math.pow(2,31)))
elif (type == "bigint"):
return str(random.randint(1, math.pow(2,63)))
elif (type == "float" or type == "double" or type == "doubleprecision" or type == "decimal" or type == "realdecimal" or type == "numeric"):
return str(random.random() * 100000000) # random endpoints for this random
elif (type == "date"):
insertTime = insertTime - timedelta(seconds=1)
return "'" + insertTime.strftime("%Y-%m-%d") + "'"
elif (type == "datetime"):
insertTime = insertTime - timedelta(seconds=1)
return "'" + insertTime.strftime("%Y-%m-%d %H:%M:%S") + "'"
elif (type == "timestamp"):
insertTime = insertTime - timedelta(seconds=1)
return "'" + insertTime.strftime("%Y%m%d%H%M%S") + "'"
elif (type == "time"):
insertTime = insertTime - timedelta(seconds=1)
return "'" + insertTime.strftime("%H:%M:%S") + "'"
elif (type == "year"):
insertTime = insertTime - timedelta(seconds=1)
return "'" + insertTime.strftime("%Y") + "'"
elif (type == "char" or type == "varchar" or type == "tinyblog" or type == "tinytext" or type == "blob" or type == "text" or type == "mediumblob"
or type == "mediumtext" or type == "longblob" or type == "longtext"):
if (size == 0): # not specified
return "'a'"
else:
lst = [random.choice(string.ascii_letters + string.digits) for n in xrange(size)]
strn = "".join(lst)
return strn
elif (type == "enum"):
return "NULL" # TBD if needed
elif (type == "set"):
return "NULL" # TBD if needed
else:
print "!!! Unrecognized mysql type: " + type + "\n"
os.exit(1)
Upvotes: 2
Views: 1188
Reputation: 23995
Also keep in mind that depending on your configuration you might get low/none speedup due to your code already being IO/Bound. If you are having single disk writing to it from many threads can do more harm than good.
Upvotes: 1
Reputation: 1429
Python's I/O is not much slower than other languages. The interpreter can be slow to start up, but writing such a large file will amortize that effect.
I'd recommend looking into the multiprocessing module, which will allow you to have true parallelism by having multiple instances of Python, which will help get around the GIL. However, these will have some overhead attached to them, but again, with an 80GB file it shouldn't matter too much. Keep in mind that each process is a complete process, meaning that it will take much more computational resources.
Upvotes: 3