Getting Title of a text

Question

I'm trying to get the Title of some Text with this code :

for line in content:
    title = re.search('^Title:(.*)$',line)
    if title:
        return(title.group(1))

For these text

DOCA.TXT:

Title: Brown Corpus: Part A

But/cc the/at seven-iron/nn shot/nn he/pps used/vbd to/to approach/vb the/at green/nn strayed/vbd into/in a/at bunker/nn and/cc lodged/vbd in/in a/at slight/jj depression/nn ./.

End

DOCB.TXT:

Title: The Brown Corpus

The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled ...

Some how i can only get the title of DOCB.TXT, but not DOCA.TXT (it shows "none")

Adam Kerz · Accepted Answer

The below works for me, perhaps you need to look at your files in a hex editor - might be an end of line issue or a unicode byte order marker, if you're not telling Python how to open the file to handle that.

#!python3.4
import re

for fn in ('a.txt','b.txt'):
    with open(fn) as fin:
        for line in fin.readlines():
            title = re.search('^Title:(.*)$',line)
            if title:
                print(title.group(1))

To open a UTF-16 with BOM file you would use something like: open(fn,encoding='utf-16')

Of course, this won't work for an ascii, iso8859-1 or utf-8 encoded file, so you would need to make sure all text files are in the same encoding or pick the right encoding for each text file.

Getting Title of a text

Answers (1)

Related Questions