Reputation: 688
I'm trying to get the Title of some Text with this code :
for line in content:
title = re.search('^Title:(.*)$',line)
if title:
return(title.group(1))
For these text
DOCA.TXT:
Title: Brown Corpus: Part A
But/cc the/at seven-iron/nn shot/nn he/pps used/vbd to/to approach/vb the/at green/nn strayed/vbd into/in a/at bunker/nn and/cc lodged/vbd in/in a/at slight/jj depression/nn ./.
End
DOCB.TXT:
Title: The Brown Corpus
The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled ...
Some how i can only get the title of DOCB.TXT, but not DOCA.TXT (it shows "none")
Upvotes: 1
Views: 85
Reputation: 951
The below works for me, perhaps you need to look at your files in a hex editor - might be an end of line issue or a unicode byte order marker, if you're not telling Python how to open the file to handle that.
#!python3.4
import re
for fn in ('a.txt','b.txt'):
with open(fn) as fin:
for line in fin.readlines():
title = re.search('^Title:(.*)$',line)
if title:
print(title.group(1))
To open a UTF-16 with BOM file you would use something like: open(fn,encoding='utf-16')
Of course, this won't work for an ascii
, iso8859-1
or utf-8
encoded file, so you would need to make sure all text files are in the same encoding or pick the right encoding for each text file.
Upvotes: 2