Adama
Adama

Reputation: 792

Convert MySQL "INSERT" commands to text

I am trying to import a recent Wikipedia dump into a MySQL database. The problem is, I am inserting a 50 GiB text table using INSERT INTO text MySQL commands and I want to convert these into a text file.

My text.sql file has the following structure:

INSERT INTO text (old_id,old_text,old_flags) VALUES (id1,'text1','flags1'),(id2,'text2','flags2'),...,(idN,'textN','flagsN');

However, using mysql -u USERNAME -p DBNAME < text.sql is very slow. I am already disabling autocommit,unique_checks and foreign_key_checks, and I am enclosing all transactions within a START TRANSACTION; ... COMMIT; block, but the import process is still very slow.

After researching, I read here that using LOAD DATA INFILE; can be much faster than using INSERT commands. Therefore, I am looking to convert text.sql to text.txt as follows:

id1,'text1','flags1'
id2,'text2','flags2'
...
idN,'textN','flagsN'

I was thinking of using awk for this, but my experience with regular expressions is very limited. Furthermore, each INSERT command is given in a single line, as shown above, making it for me even more difficult to extract the values.

Given that the text.sql file is 50 GiB, would you recommend using awk or to develop a C/C++ program? If awk is a good approach, how could I achieve the conversion?

Input #1 example:

INSERT INTO text (old_id,old_text,old_flags) VALUES (id1,'text1','flags1'),(id2,'text2','flags2'),(id3,'text3','flags3');

Output #1 example:

id1,'text1','flags1'
id2,'text2','flags2'
id3,'text3','flags3'

Input #2 example: (with parenthesis in the values)

INSERT INTO page (page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len,page_content_model) VALUES (10,0,'AccessibleComputing','',1,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,631144794,69,'wikitext'),(12,0,'Anarchism','',0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,703037144,180446,'wikitext');

Output #2 example:

10,0,'AccessibleComputing','',1,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,631144794,69,'wikitext'
12,0,'Anarchism','',0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,703037144,180446,'wikitext'

Input #3 example: (with escaped ' or ")

INSERT INTO text (old_id,old_text,old_flags) VALUES (631144794,'#REDIRECT [[Computer accessibility]]\n\n{{Redr|move|from CamelCase|up}}','utf-8'),(703037144,'{{Redirect2|Anarchist|Anarchists

|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}\n{{pp-move-indef}}\n{{Use British English|date=January 2014}}','utf-8');

Output #3 example:

631144794,'#REDIRECT [[Computer accessibility]]\n\n{{Redr|move|from CamelCase|up}}','utf-8'
703037144,'{{Redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}\n{{pp-move-indef}}\n{{Use British English|date=January 2014}}','utf-8'

edit: after conducting some more research, it appears that examples #2 and #3 may not be converted using regular expressions: sources: #1, #2.

Upvotes: 0

Views: 353

Answers (2)

Ed Morton
Ed Morton

Reputation: 203502

If this isn't what you want:

$ awk -v FPAT='[(][^)]+[)]' '{for (i=2;i<=NF;i++) print substr($i,2,length($i)-2)}' file
id1,'text1','flags1'
id2,'text2','flags2'
idN,'textN','flagsN'

then edit your question to provide clearer, testable sample input and expected output.

The above used GNU awk for FPAT, with other awks you'd use a while(match()) loop.

Upvotes: 1

White Feather
White Feather

Reputation: 2783

Use this:

sed -e 's/(//' -e 's/),//' test.csv

(appropriately piped) and all your lines will be clean.

Change first and last lines manually.

Regards

Upvotes: 0

Related Questions