tanbog
tanbog

Reputation: 610

Errors inserting many rows into postgreSQL with psycopg2

I have a a number of XML files I need to open and then process to produce a large number of rows that are then inserted into several tables in a remote postgress database.

To extract the XML data I am using xml.etree.ElementTree to parse the XML tree and extract elements as needed. While I am doing a number of things, the basic action is to take a specific element, whether String or Integer and place in one of a number of dictionaries.

After some more processing I have a number of dictionaries that I need to insert into my database. For any single xml file I may produce up to 8-10,000 rows (or queries) across 3 tables.

While testing, I was outputting to sql files and then manually running the query. That is obviously not going to work if I have lots of xml files..

I have therefore tried using psycopg2 to automate this process. As I understand it from stack overflow and elsewhere running individual execute functions is painfully slow. Based on This stackoverflow question I have attempted to write code as follows:

QueryData = ','.join(cur.mogrify('(%s,%s,%s)', row) for row in myData)
cur.execute('INSERT INTO DBTABLE' + QueryData)
cur.commit()

where myData is a list of tuples [(a,b,c),(a,b,c),(a,b,c)...] the contents of which are a combination of data extracted by xml.etree.ElementTree and values I have calculated myself.

When I try to actually execute the above code however I get the following error:

TypeError: sequence item 0: expected str instance, bytes found

OK... if I then try to convert my data (each tuple element) to str() however I get:

TypeError: encoding without a string argument

Am I just going about this totally wrong? How can I do what I need? I am using Python3.

ADDITIONAL

I was asked to show an example of the data.

Here is the simplest, it is 3 integer values to put into a table. It is of the form: (document_id,item_index,item_code)

A typical example would be: (937, 138, 681)

My general attempts to convert have been to try:

(str(document_id),str(item_index),str(item_code))

I have also tried going the other way:

(bytes(document_id,'utf-8'),bytes(item_index,'utf-8'),bytes(item_code,'utf-8'))

the latter also raises the error: TypeError: encoding without a string argument

Upvotes: 3

Views: 2152

Answers (3)

Agargara
Agargara

Reputation: 932

The psycopg documentation states that for cur.mogrify:

The returned string is always a bytes string.

So to use this hack, you just need to decode the result of mogrify back to a string, for example:

QueryData = ','.join(cur.mogrify('(%s,%s,%s)', row).decode('utf-8') for row in myData)
cur.execute('INSERT INTO DBTABLE' + QueryData)

However, as mentioned in this StackOverflow question, the most efficient way to copy large amounts of data is to use COPY. You can do so with any "python file-like object". Here's an example from the psycopg docs:

>>> f = StringIO("42\tfoo\n74\tbar\n")
>>> cur.copy_from(f, 'test', columns=('num', 'data'))
>>> cur.execute("select * from test where id > 5;")
>>> cur.fetchall()
[(6, 42, 'foo'), (7, 74, 'bar')]

Upvotes: 7

tanbog
tanbog

Reputation: 610

Ok so I have it working... I am however confused as to why my solution worked. I am posting it as an answer but if someone could explain to me what is going on that would be great:

basically this:

QueryData = ','.join(cur.mogrify('(%s,%s,%s)', row) for row in myData)
cur.execute('INSERT INTO DBTABLE' + QueryData)

had to be changed to:

QueryData = b','.join(cur.mogrify(b'(%s,%s,%s)', row) for row in myData)
cur.execute(b'INSERT INTO DBTABLE' + QueryData)

Which strikes me as pretty inelegant.

Upvotes: 0

Juan Albarracín
Juan Albarracín

Reputation: 143

you are missing VALUES after the table name, everything else seems correct:

cursorPG.execute("INSERT INTO test VALUES "+','.join(cursorPG.mogrify('(%s,%s)',x) for x in mydata))

Upvotes: 0

Related Questions