Reputation: 792
I am trying to import a recent Wikipedia dump into a MySQL database. The problem is, I am inserting a 50 GiB text
table using INSERT INTO text
MySQL commands and I want to convert these into a text file.
My text.sql
file has the following structure:
INSERT INTO text (old_id,old_text,old_flags) VALUES (id1,'text1','flags1'),(id2,'text2','flags2'),...,(idN,'textN','flagsN');
However, using mysql -u USERNAME -p DBNAME < text.sql
is very slow. I am already disabling autocommit
,unique_checks
and foreign_key_checks
, and I am enclosing all transactions within a START TRANSACTION; ... COMMIT;
block, but the import process is still very slow.
After researching, I read here that using LOAD DATA INFILE;
can be much faster than using INSERT
commands. Therefore, I am looking to convert text.sql
to text.txt
as follows:
id1,'text1','flags1'
id2,'text2','flags2'
...
idN,'textN','flagsN'
I was thinking of using awk
for this, but my experience with regular expressions is very limited. Furthermore, each INSERT
command is given in a single line, as shown above, making it for me even more difficult to extract the values.
Given that the text.sql
file is 50 GiB, would you recommend using awk
or to develop a C/C++ program? If awk
is a good approach, how could I achieve the conversion?
Input #1 example:
INSERT INTO text (old_id,old_text,old_flags) VALUES (id1,'text1','flags1'),(id2,'text2','flags2'),(id3,'text3','flags3');
Output #1 example:
id1,'text1','flags1'
id2,'text2','flags2'
id3,'text3','flags3'
Input #2 example: (with parenthesis in the values)
INSERT INTO page (page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len,page_content_model) VALUES (10,0,'AccessibleComputing','',1,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,631144794,69,'wikitext'),(12,0,'Anarchism','',0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,703037144,180446,'wikitext');
Output #2 example:
10,0,'AccessibleComputing','',1,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,631144794,69,'wikitext'
12,0,'Anarchism','',0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,703037144,180446,'wikitext'
Input #3 example: (with escaped ' or ")
INSERT INTO text (old_id,old_text,old_flags) VALUES (631144794,'#REDIRECT [[Computer accessibility]]\n\n{{Redr|move|from CamelCase|up}}','utf-8'),(703037144,'{{Redirect2|Anarchist|Anarchists
|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}\n{{pp-move-indef}}\n{{Use British English|date=January 2014}}','utf-8');
Output #3 example:
631144794,'#REDIRECT [[Computer accessibility]]\n\n{{Redr|move|from CamelCase|up}}','utf-8'
703037144,'{{Redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}\n{{pp-move-indef}}\n{{Use British English|date=January 2014}}','utf-8'
edit: after conducting some more research, it appears that examples #2 and #3 may not be converted using regular expressions: sources: #1, #2.
Upvotes: 0
Views: 353
Reputation: 203502
If this isn't what you want:
$ awk -v FPAT='[(][^)]+[)]' '{for (i=2;i<=NF;i++) print substr($i,2,length($i)-2)}' file
id1,'text1','flags1'
id2,'text2','flags2'
idN,'textN','flagsN'
then edit your question to provide clearer, testable sample input and expected output.
The above used GNU awk for FPAT
, with other awks you'd use a while(match())
loop.
Upvotes: 1
Reputation: 2783
Use this:
sed -e 's/(//' -e 's/),//' test.csv
(appropriately piped) and all your lines will be clean.
Change first and last lines manually.
Regards
Upvotes: 0