Python: Split text by keyword into excel rows

Question

New to programming, found a lot of helpful threads already, but just not quite what I need.
I have one text file that looks like:

  1 of 5000 DOCUMENTS


                    Copyright 2010 The Deal, L.L.C.
                          All Rights Reserved
                          Daily Deal/The Deal

                        January 12, 2010 Tuesday

HEADLINE: Cadbury slams Kraft bid

BODY:

  On cue .....

......

body of article here

......

DEAL SIZE

$ 10-50 Billion

                            2 of 5000 DOCUMENTS


                    Copyright 2015 The Deal, L.L.C.
                          All Rights Reserved
                           The Deal Pipeline

                      September 17, 2015 Thursday

HEADLINE: Perrigo rejects formal offer from Mylan

BODY: 
(and here again the body of this article)

DEAL SIZE

As output I would like JUST the body of every article in a new row (one cell per article body) in one file (I have around 5000 articles to process like this). The output would be 5000 rows and 1 column. From what I could find it seems 're' would be the best solution. So the recurring keywords are BODY: and perhaps DOCUMENTS. How do I extract just the text between those keywords into a new row in excel for every article?

import re
inputtext = 'F:	ext.txt'
re.split(r'
(?=BODY:)', inputtext)

or something like this?

section = []
for line in open_file_object:
if line.startswith('BODY:'):
    # new section
    if section:
        process_section(section)
    section = [line]
else:
    section.append(line)
if section:
process_section(section)

I'm a bit lost in where to look, thanks in advance!

EDIT: Thanks to ewwink I'm currently here:

import re
articlesBody = None
with open('F:\CloudStation\Bocconi University\MSc. Thesis\test folder\majortest.txt', 'r') as txt:
  inputtext = txt.read()
  articlesBody = re.findall(r'BODY:(.+?)\d\sDOCUMENTS', inputtext, re.S)

#print(articlesBody)
#print(type(articlesBody))

  with open('result.csv', 'w') as csv:
   for item in articlesBody:
    item = item.replace('
', ' ')
    csv.write('"%s",' % item)

ewwink · Accepted Answer

working with file use with open('F: ext.txt', mode) where mode are 'r' for reading and 'w' for writing, to extract the content use re.findall and finally you need to escape new line , double quotes "and maybe other character.

import re

articlesBody = None
with open('text.txt', 'r') as txt:
  inputtext = txt.read()
  articlesBody = re.findall(r'BODY:(.+?)\d\sof\s5000', inputtext, re.S)

#print(articlesBody)

with open('result.csv', 'w') as csv:
  for item in articlesBody:
    item = item.replace('
', '\n').replace('"', '""')
    csv.write('"%s",' % item)

another note: try with small content

Python: Split text by keyword into excel rows

Answers (1)

Related Questions