Reputation: 2876
I need to import content from WordPress into Plone, a Python-based CMS, and I have a dump of the posts table as a huge CSV vanilla file using ";" as a delimiter.
The problem is the standard CSV reader from the csv module is not smart enough to parse the HTML content inside a row (the post_content
field).
For instance, when the parser encounters something like <p> </p>
it interprets the semicolon as a field delimiter and I end up with more items than fields and with fields with wrong content.
Is there any other option to solve this kind of issues? Processing the row with a regex seems pretty scary to me.
Upvotes: 2
Views: 891
Reputation: 2876
Another option, for smaller sites, could be using pywordpress, a pythonic interface to WordPress XML-RPC API.
Upvotes: 1
Reputation: 2876
After some additional research, I discovered the excel-tab
dialect by reading the text of the PEP 0305 (which proposed the addition of the cvs module to Python); this is mentioned in the module documentation, but I haven't noticed at first.
I then re-exported the posts using a tab as a delimiter (\t
).
I made a test reading a batch of 1,000 rows and found no errors at all.
Upvotes: 2
Reputation: 3369
The CSV module provides the escapechar
format parameter, which allows you to escape the delimiter (which you have set to semicolon). If you can provide escapechar='\\'
in the call to csv.reader()
, you could then replace the character \
in your CSV file with \\
, and replace
with  \;
(using a text editor's find/replace option).
Upvotes: 1