seamen33
seamen33

Reputation: 3

Python: Split text by keyword into excel rows

New to programming, found a lot of helpful threads already, but just not quite what I need.
I have one text file that looks like:

  1 of 5000 DOCUMENTS


                    Copyright 2010 The Deal, L.L.C.
                          All Rights Reserved
                          Daily Deal/The Deal

                        January 12, 2010 Tuesday

HEADLINE: Cadbury slams Kraft bid

BODY:

  On cue .....

......

body of article here

......

DEAL SIZE

$ 10-50 Billion

                            2 of 5000 DOCUMENTS


                    Copyright 2015 The Deal, L.L.C.
                          All Rights Reserved
                           The Deal Pipeline

                      September 17, 2015 Thursday

HEADLINE: Perrigo rejects formal offer from Mylan

BODY: 
(and here again the body of this article)

DEAL SIZE

As output I would like JUST the body of every article in a new row (one cell per article body) in one file (I have around 5000 articles to process like this). The output would be 5000 rows and 1 column. From what I could find it seems 're' would be the best solution. So the recurring keywords are BODY: and perhaps DOCUMENTS. How do I extract just the text between those keywords into a new row in excel for every article?

import re
inputtext = 'F:\text.txt'
re.split(r'\n(?=BODY:)', inputtext)

or something like this?

section = []
for line in open_file_object:
if line.startswith('BODY:'):
    # new section
    if section:
        process_section(section)
    section = [line]
else:
    section.append(line)
if section:
process_section(section)

I'm a bit lost in where to look, thanks in advance!

EDIT: Thanks to ewwink I'm currently here:

import re
articlesBody = None
with open('F:\CloudStation\Bocconi University\MSc. Thesis\\test folder\majortest.txt', 'r') as txt:
  inputtext = txt.read()
  articlesBody = re.findall(r'BODY:(.+?)\d\sDOCUMENTS', inputtext, re.S)

#print(articlesBody)
#print(type(articlesBody))

  with open('result.csv', 'w') as csv:
   for item in articlesBody:
    item = item.replace('\n', ' ')
    csv.write('"%s",' % item)

Upvotes: 0

Views: 516

Answers (1)

ewwink
ewwink

Reputation: 19164

working with file use with open('F:\text.txt', mode) where mode are 'r' for reading and 'w' for writing, to extract the content use re.findall and finally you need to escape new line \n, double quotes "and maybe other character.

import re

articlesBody = None
with open('text.txt', 'r') as txt:
  inputtext = txt.read()
  articlesBody = re.findall(r'BODY:(.+?)\d\sof\s5000', inputtext, re.S)

#print(articlesBody)

with open('result.csv', 'w') as csv:
  for item in articlesBody:
    item = item.replace('\n', '\\n').replace('"', '""')
    csv.write('"%s",' % item)

another note: try with small content

Upvotes: 1

Related Questions