Reputation: 3
New to programming, found a lot of helpful threads already, but just not quite what I need.
I have one text file that looks like:
1 of 5000 DOCUMENTS
Copyright 2010 The Deal, L.L.C.
All Rights Reserved
Daily Deal/The Deal
January 12, 2010 Tuesday
HEADLINE: Cadbury slams Kraft bid
BODY:
On cue .....
......
body of article here
......
DEAL SIZE
$ 10-50 Billion
2 of 5000 DOCUMENTS
Copyright 2015 The Deal, L.L.C.
All Rights Reserved
The Deal Pipeline
September 17, 2015 Thursday
HEADLINE: Perrigo rejects formal offer from Mylan
BODY:
(and here again the body of this article)
DEAL SIZE
As output I would like JUST the body of every article in a new row (one cell per article body) in one file (I have around 5000 articles to process like this). The output would be 5000 rows and 1 column. From what I could find it seems 're' would be the best solution. So the recurring keywords are BODY: and perhaps DOCUMENTS. How do I extract just the text between those keywords into a new row in excel for every article?
import re
inputtext = 'F:\text.txt'
re.split(r'\n(?=BODY:)', inputtext)
or something like this?
section = []
for line in open_file_object:
if line.startswith('BODY:'):
# new section
if section:
process_section(section)
section = [line]
else:
section.append(line)
if section:
process_section(section)
I'm a bit lost in where to look, thanks in advance!
EDIT: Thanks to ewwink I'm currently here:
import re
articlesBody = None
with open('F:\CloudStation\Bocconi University\MSc. Thesis\\test folder\majortest.txt', 'r') as txt:
inputtext = txt.read()
articlesBody = re.findall(r'BODY:(.+?)\d\sDOCUMENTS', inputtext, re.S)
#print(articlesBody)
#print(type(articlesBody))
with open('result.csv', 'w') as csv:
for item in articlesBody:
item = item.replace('\n', ' ')
csv.write('"%s",' % item)
Upvotes: 0
Views: 516
Reputation: 19164
working with file use with open('F:\text.txt', mode)
where mode
are 'r'
for reading and 'w'
for writing, to extract the content use re.findall
and finally you need to escape new line \n
, double quotes "
and maybe other character.
import re
articlesBody = None
with open('text.txt', 'r') as txt:
inputtext = txt.read()
articlesBody = re.findall(r'BODY:(.+?)\d\sof\s5000', inputtext, re.S)
#print(articlesBody)
with open('result.csv', 'w') as csv:
for item in articlesBody:
item = item.replace('\n', '\\n').replace('"', '""')
csv.write('"%s",' % item)
another note: try with small content
Upvotes: 1