Reputation: 344
I have been able to extract close to 3.5 mil rows from a postgres table using python and write to a file. However the process is extremely slow and I'm sure not the most efficient. Following is my code:
import psycopg2, time,csv
conn_string = "host='compute-1.amazonaws.com' dbname='re' user='data' password='reck' port=5433"
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
quert = '''select data from table;'''
cursor.execute(quert)
def get_data():
while True:
recs = cursor.fetchmany(10000)
if not recs:
break
for columns in recs:
# do transformation of data here
yield(columns)
solr_input=get_data()
with open('prc_ind.csv','a') as fh:
for i in solr_input:
count += 1
if count % 1000 == 0:
print(count)
a,b,c,d = i['Skills'],i['Id'],i['History'],i['Industry']
fh.write("{0}|{1}|{2}|{3}\n".format(a,b,c,d))
The table has about 8 mil rows. I want to ask is there is a better, faster and less memory intensive way to accomplish this.
Upvotes: 0
Views: 2959
Reputation: 61
Psycopg2's copy_to command does the exact same thing as a psql dump, as Loïc suggested, except it's in the python side of things. I've found this to be the fastest way to get a table dump.
The formatting for certain data types (such as hstore/json and composite types) is a bit funky, but the command is very simple.
f = open('foobar.dat', 'wb')
cursor.copy_to(f, 'table', sep='|', columns=['skills', 'id', 'history', 'industry'])
Docs here: http://initd.org/psycopg/docs/cursor.html#cursor.copy_to
Upvotes: 3
Reputation: 11942
I can see four fields, so I'll assume you are selecting only these.
But even then, you are still loading 8 mil x 4 x n Bytes of data from what seems to be another server. So yes it'll take some time.
Though you are trying to rebuild the wheel, why not use the PostgreSQL client?
psql -d dbname -t -A -F"," -c "select * from users" > output.csv
Upvotes: 3