Parse a very large text file with Python?

Question

So, the file has about 57,000 book titles, author names and a ETEXT No. I am trying to parse the file to only get the ETEXT NOs

The File is like this:

TITLE and AUTHOR                                                     ETEXT NO.

Aspects of plant life; with special reference to the British flora,      56900
 by Robert Lloyd Praeger

The Vicar of Morwenstow, by Sabine Baring-Gould                          56899
 [Subtitle: Being a Life of Robert Stephen Hawker, M.A.]

Raamatun tutkisteluja IV, mennessä Charles T. Russell                    56898
 [Subtitle: Harmagedonin taistelu]
 [Language: Finnish]

Raamatun tutkisteluja III, mennessä Charles T. Russell                   56897
 [Subtitle: Tulkoon valtakuntasi]
 [Language: Finnish]

Tom Thatcher's Fortune, by Horatio Alger, Jr.                            56896

A Yankee Flier in the Far East, by Al Avery                              56895
 and George Rutherford Montgomery
 [Illustrator: Paul Laune]

Nancy Brandon's Mystery, by Lillian Garis                                56894

Nervous Ills, by Boris Sidis                                             56893
 [Subtitle: Their Cause and Cure]

Pensées sans langage, par Francis Picabia                                56892
 [Language: French]

Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss     56891
 [Subtitle: A picture of Judaism, in the century
  which preceded the advent of our Savior]

Fra Tommaso Campanella, Vol. 1, di Luigi Amabile                         56890
 [Subtitle: la sua congiura, i suoi processi e la sua pazzia]
 [Language: Italian]

The Blue Star, by Fletcher Pratt                                         56889

Importanza e risultati degli incrociamenti in avicoltura,                56888
 di Teodoro Pascal
 [Language: Italian]

And this is what I tried:

def search_by_etext():

    fhand = open('GUTINDEX.ALL')
    print("Search by ETEXT:")

    for line in fhand:
        if not line.startswith(" [") and not line.startswith("~"):
            if not line.startswith(" ") and not line.startswith("TITLE"):
                    words = line.rstrip()
                    words = line.lstrip()
                    words = words[-7:]
                    print (words)


search_by_etext()

Well the code mostly works. However for some lines it gives me part of title or other things. Like: This kind of output(), containing 'decott' which is a part of author name and shouldn't be here. 2

For this: The Bashful Earthquake, by Oliver Herford 56765 [Subtitle: and Other Fables and Verses]

The House of Orchids and Other Poems, by George Sterling 56764

North Italian Folk, by Alice Vansittart Strettel Carr 56763 and Randolph Caldecott [Subtitle: Sketches of Town and Country Life]

Wild Life in New Zealand. Part 1, Mammalia, by George M. Thomson 56762 [Subtitle: New Zealand Board of Science and Art, Manual No. 2]

Universal Brotherhood, Volume 13, No. 10, January 1899, by Various 56761

De drie steden: Lourdes, door Émile Zola 56760 [Language: Dutch]

Another example:

4

For Rhandensche Jongens, door Jan Lens 56702 [Illustrator: Tjeerd Bottema] [Language: Dutch]

The Story of The Woman's Party, by Inez Haynes Irwin 56701

Mormon Doctrine Plain and Simple, by Charles W. Penrose 56700 [Subtitle: Or Leaves from the Tree of Life]

The Stone Axe of Burkamukk, by Mary Grant Bruce 56699 [Illustrator: J. Macfarlane]

The Latter-Day Prophet, by George Q. Cannon 56698 [Subtitle: History of Joseph Smith Written for Young People]

Here: Life] shouldn't be there. Lines starting with blank space has been parsed out with this:

if not line.startswith(" [") and not line.startswith("~"):

But Still I am getting those off values in my output results.

brabster · Accepted Answer

It could be that those extra lines that are not being filtered out start with whitespace other than a " " char, like a tab for example. As a minimal change that might work, try filtering out lines that start with any whitespace rather than specifically a space char?

To check for whitespace in general rather than a space char, you'll need to use regular expressions. Try if not re.match(r'^\s', line) and ...

Parse a very large text file with Python?

Answers (2)

Related Questions