Reputation: 1993
I have the following code and I'm running it on some big data (2 hours processing time), I'm looking into CUDA for GPU acceleration, but in the mean time can anyone suggest ways to optimise the following code?
I is taking a 3D point from dataset 'T' and finding the point with the minimum distance to another point dataset 'B'
Is there any time saved by sending the result to a list first then inserting to the database table?
All suggestions welcome
conn = psycopg2.connect("<details>")
cur = conn.cursor()
for i in range(len(B)):
i2 = i + 1
# point=T[i]
point = B[i:i2]
# print(B[i])
# print(B[i:i2])
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
print("Base: ", end='')
print(i, end='')
print(" of ", end='')
print(len(B), end='')
print(" ", end='')
print(disti)
cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""",
(xi[i], yi[i], zi[i], disti))
conn.commit()
cur.close()
Code update:
conn = psycopg2.connect("dbname=kap_pointcloud host=localhost user=postgres password=Gnob2009")
cur = conn.cursor()
disti = []
for i in range(len(T)):
i2 = i + 1
point = T[i:i2]
disti.append(scipy.spatial.distance.cdist(point, B, metric='euclidean').min())
print("Top: " + str(i) + " of " + str(len(T)))
Insert code to go here once I figure out the syntax
@@@@@@@@ EDIT @@@@@@@@
The solution with a lot of help from Alex
cur = conn.cursor()
# list for accumulating insert-params
from scipy.spatial.distance import cdist
insert_params = []
for i in range(len(T)):
XA = [B[i]]
disti = cdist(XA, XB, metric='euclidean').min()
insert_params.append((xi[i], yi[i], zi[i], disti))
print("Top: " + str(i) + " of " + str(len(T)))
# Only one instruction to insert everything
cur.executemany("INSERT INTO pc_processing.pc_dist_top_tmp (x,y,z,dist) values (%s, %s, %s, %s)",
insert_params)
conn.commit()
For timing comparison the:
inital code took: 0:00:50.225644
Without multiline prints: 0:00:47.934012
taking commit out of the loop: 0:00:25.411207
I'm assuming the only way to make it faster is to get CUDA working?
Upvotes: 1
Views: 80
Reputation: 3547
OK. Let's accumulate all suggestions from comments.
commit
as rare as possible, don't print
at allconn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[]
for i in range(len(B)):
i2 = i + 1
point = B[i:i2]
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""", (xi[i], yi[i], zi[i], disti))
conn.commit() # Note that you commit only once. Be careful with **realy** big chunks of data
cur.close()
If you really need debug information inside your loops - use logging
.
You will be able to turn on/off logging info when you need.
executemany
for rescueconn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[] # list for accumulating insert-params
for i in range(len(B)):
i2 = i + 1
point = B[i:i2]
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
insert_params.append((xi[i], yi[i], zi[i], disti))
# Only one instruction to insert everything
cur.executemany("INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)", insert_params)
conn.commit()
cur.close()
psycopg2
at all. Use BULK operationsInstead of cur.execute
, conn.commit
write csv
-file.
And then use COPY from created file.
BULK solution must provide ultimate performance but needs an effort to make it work.
Choose yourself what is appropriate for you - how much speed do you need.
Good luck
Upvotes: 1
Reputation: 908
There are 2 solutions
1) Try to do the single commit or commit in chunks if len(B) is very large.
2) you can prepare a list of data that you are inserting and do the bulk insert.
eg:
insert into pc_processing.pc_dist_base_tmp (x, y, z, dist) select * from unnest(array[1, 2, 3, 4], array[1, 2, 3, 4]);
Upvotes: 3
Reputation: 11
Try committing when the loop is finished instead of every single iteration
Upvotes: 1