Reputation: 610
I have a a number of XML files I need to open and then process to produce a large number of rows that are then inserted into several tables in a remote postgress database.
To extract the XML data I am using xml.etree.ElementTree
to parse the XML tree and extract elements as needed. While I am doing a number of things, the basic action is to take a specific element, whether String or Integer and place in one of a number of dictionaries.
After some more processing I have a number of dictionaries that I need to insert into my database. For any single xml file I may produce up to 8-10,000 rows (or queries) across 3 tables.
While testing, I was outputting to sql files and then manually running the query. That is obviously not going to work if I have lots of xml files..
I have therefore tried using psycopg2
to automate this process. As I understand it from stack overflow and elsewhere running individual execute
functions is painfully slow. Based on This stackoverflow question I have attempted to write code as follows:
QueryData = ','.join(cur.mogrify('(%s,%s,%s)', row) for row in myData)
cur.execute('INSERT INTO DBTABLE' + QueryData)
cur.commit()
where myData
is a list of tuples [(a,b,c),(a,b,c),(a,b,c)...]
the contents of which are a combination of data extracted by xml.etree.ElementTree
and values I have calculated myself.
When I try to actually execute the above code however I get the following error:
TypeError: sequence item 0: expected str instance, bytes found
OK... if I then try to convert my data (each tuple element) to str()
however I get:
TypeError: encoding without a string argument
Am I just going about this totally wrong? How can I do what I need? I am using Python3.
ADDITIONAL
I was asked to show an example of the data.
Here is the simplest, it is 3 integer values to put into a table. It is of the form: (document_id,item_index,item_code)
A typical example would be: (937, 138, 681)
My general attempts to convert have been to try:
(str(document_id),str(item_index),str(item_code))
I have also tried going the other way:
(bytes(document_id,'utf-8'),bytes(item_index,'utf-8'),bytes(item_code,'utf-8'))
the latter also raises the error: TypeError: encoding without a string argument
Upvotes: 3
Views: 2152
Reputation: 932
The psycopg documentation states that for cur.mogrify:
The returned string is always a bytes string.
So to use this hack, you just need to decode the result of mogrify back to a string, for example:
QueryData = ','.join(cur.mogrify('(%s,%s,%s)', row).decode('utf-8') for row in myData)
cur.execute('INSERT INTO DBTABLE' + QueryData)
However, as mentioned in this StackOverflow question, the most efficient way to copy large amounts of data is to use COPY. You can do so with any "python file-like object". Here's an example from the psycopg docs:
>>> f = StringIO("42\tfoo\n74\tbar\n")
>>> cur.copy_from(f, 'test', columns=('num', 'data'))
>>> cur.execute("select * from test where id > 5;")
>>> cur.fetchall()
[(6, 42, 'foo'), (7, 74, 'bar')]
Upvotes: 7
Reputation: 610
Ok so I have it working... I am however confused as to why my solution worked. I am posting it as an answer but if someone could explain to me what is going on that would be great:
basically this:
QueryData = ','.join(cur.mogrify('(%s,%s,%s)', row) for row in myData)
cur.execute('INSERT INTO DBTABLE' + QueryData)
had to be changed to:
QueryData = b','.join(cur.mogrify(b'(%s,%s,%s)', row) for row in myData)
cur.execute(b'INSERT INTO DBTABLE' + QueryData)
Which strikes me as pretty inelegant.
Upvotes: 0
Reputation: 143
you are missing VALUES after the table name, everything else seems correct:
cursorPG.execute("INSERT INTO test VALUES "+','.join(cursorPG.mogrify('(%s,%s)',x) for x in mydata))
Upvotes: 0