chickpeaze
chickpeaze

Reputation: 11

Writing csv from tuple (of strings) in Python

To understand my current problem, here is some background info on the broader problem:

I have a large text file composed of multiple documents. I needed to find a way to organize this file into its constituent components. Unfortunately, all of the individual documents had different formats, and the only commonality among them was the fact that the head of each document included the date, written in the same format each time: dd MONTH yyyy. I used the dates as bookends to isolate the text in between them.

#the date pattern with positive lookbehind
bookend_1 = "(?<=\d{1,2}\sJANUARY\s\d{4}|\d{1,2}\sFEBRUARY\s\d{4}|\d{1,2}\sMARCH\s\d{4}|\d{1,2}\sAPRIL\s\d{4}|\d{1,2}\sMAY\s\d{4}|\d{1,2}\sJUNE\s\d{4}|\d{1,2}\sJULY\s\d{4}|\d{1,2}\sAUGUST\s\d{4}|\d{1,2}\sSEPTEMBER\s\d{4}|\d{1,2}\sOCTOBER\s\d{4}|\d{1,2}\sNOVEMBER\s\d{4}|\d{1,2}\sDECEMBER\s\d)"

#The date pattern with positive lookahead
bookend_2 = "(?=\d{1,2}\sJANUARY\s\d{4}|\d{1,2}\sFEBRUARY\s\d{4}|\d{1,2}\sMARCH\s\d{4}|\d{1,2}\sAPRIL\s\d{4}|\d{1,2}\sMAY\s\d{4}|\d{1,2}\sJUNE\s\d{4}|\d{1,2}\sJULY\s\d{4}|\d{1,2}\sAUGUST\s\d{4}|\d{1,2}\sSEPTEMBER\s\d{4}|\d{1,2}\sOCTOBER\s\d{4}|\d{1,2}\sNOVEMBER\s\d{4}|\d{1,2}\sDECEMBER\s\d)"

#using the bookends to find the text in between dates
docs = regex.findall(bookend_1+'(.*?)'+ bookend_2, psc_comm_raw, re.DOTALL|re.MULTILINE) 

Using regular expressions, I created two lists: one of all the dates, one of all the text passages that occured between the dates. I zipped these lists into a tuple. I couldn't zip them into a dictionary because some of the dates are repeated.

psc_comm_tuple = list(zip(date, docs))

Here are a few lines of psc_comm_tuple.

[('27 JULY 2004',
  ' ADDIS ABABA, ETHIOPIA\n\nPSC/PR/Comm.(XIII)\n\nCOMMUNIQUÉ\n\nPSC/PR/Comm.(XIII) Page l\n\nCOMMUNIQUÉ OF THE THIRTEENTH MEETING OF THE PEACE AND SECURITY COUNCIL\n\nThe Peace and Security Council (PSC) of the African Union (AU), at its thirteenth meeting, held on 27 July 2004, adopted the following communiqué on the crisis in the Darfur region of the Sudan:\n\nCouncil,\n\n1.\tReiterates its deep concern over the grave situation that still prevails in the Darfur region of the Sudan, in particular the continued attacks by the Janjaweed militia against the civilian population, as well as other human rights abuses and the humanitarian crisis;\n\n2.\tUnderlines the urgent need to implement decision AU/Dec.54(111) on Darfur, adopted by the 3rd Ordinary Session of the Assembly...'),
 ('29 JANUARY 2001',
  '\n\nThe Central Organ of the OAU Mechanism for Conflict Prevention, Management and Resolution held its seventy-third * ordinary session at the level of Ambassadors on 29 January 2001, in Addis Ababa. The session was chaired by Ambassador Kati Ohara Korga, Permanent Representative of Togo to the OAU.\n\nHaving considered the Report of the Secretary General on the Democratic Republic of the Congo (DRC) and the situation in that country, the Central Organ:\n\n1.\tstrongly condemns the assassination of Pre...'),
 ('20 MARCH 2001',
  "\n\nThe Central Organ of the OAU Mechanism for Conflict Prevention, Management and Resolution held its 74th ordinary session at ambassadorial level, in Addis Ababa, Ethiopia, on Tuesday March 20, 2001. The session was chaired by Ambassador Ohara Korga, Permanent representative of Togo to the OAU....'),
 ('22 AUGUST 2001',
  '\n\nThe Central Organ of the OAU Mechanism for Conflict Prevention, Management and Resolution held its 75th Ordinary Session at Ambassadorial level in Addis Ababa, Ethiopia, on Wednesday 22 August 2001....')...]

My final goal is to create a CSV with two columns: one for dates, the other for the body of text associated with that date.

import csv
import os

with open('psc_comm.csv','w') as out:
    csv_out=csv.writer(out)
    csv_out.writerow(['date','text'])
    for row in psc_comm_tuple:
        csv_out.writerow(row)

When I write the tuple output to a csv, some of the rows are totally fine. But some of the output goes haywire - the text is broken into seemingly random chunks, and there are blank rows, rows of sentence fragments. There are hundreds of these occurrences. When I look back at the original document and find the corresponding places where the sentences are breaking, I can't see anything special or unique about the text itself. There aren't special characters. It's just plain text. However, they do seem to be particularly long sections of text, so I'm wondering if there is a limit to how much information a single cell in a CSV file can contain.

My question is: why is the CSV output so funky in some places but not others? Is there a limit of how much text you can put into each cell?

Image of csv file

Upvotes: 1

Views: 143

Answers (1)

alexis
alexis

Reputation: 50200

You don't give enough information to identify the problem, but Excel tends to have problems with reading CSV cells with embedded newlines, so my first guess is that that's where it's going wrong: You have a csv with embedded newlines, which csvwriter might be writing in a reversible way but Excel isn't parsing properly.

In other words, it may be that there's nothing wrong with your CSV file; just with the way it's being read into Excel. You don't say how you establish that there's a problem.

If your goal is to generate something that Excel can read, I'd dispense with the CSV format and go straight for a spreadsheet. The module openpyxl can generate xlsx documents, and works great.

Upvotes: 1

Related Questions