Azazel
Azazel

Reputation: 167

Parse a very large text file with Python?

So, the file has about 57,000 book titles, author names and a ETEXT No. I am trying to parse the file to only get the ETEXT NOs

The File is like this:

TITLE and AUTHOR                                                     ETEXT NO.

Aspects of plant life; with special reference to the British flora,      56900
 by Robert Lloyd Praeger

The Vicar of Morwenstow, by Sabine Baring-Gould                          56899
 [Subtitle: Being a Life of Robert Stephen Hawker, M.A.]

Raamatun tutkisteluja IV, mennessä Charles T. Russell                    56898
 [Subtitle: Harmagedonin taistelu]
 [Language: Finnish]

Raamatun tutkisteluja III, mennessä Charles T. Russell                   56897
 [Subtitle: Tulkoon valtakuntasi]
 [Language: Finnish]

Tom Thatcher's Fortune, by Horatio Alger, Jr.                            56896

A Yankee Flier in the Far East, by Al Avery                              56895
 and George Rutherford Montgomery
 [Illustrator: Paul Laune]

Nancy Brandon's Mystery, by Lillian Garis                                56894

Nervous Ills, by Boris Sidis                                             56893
 [Subtitle: Their Cause and Cure]

Pensées sans langage, par Francis Picabia                                56892
 [Language: French]

Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss     56891
 [Subtitle: A picture of Judaism, in the century
  which preceded the advent of our Savior]

Fra Tommaso Campanella, Vol. 1, di Luigi Amabile                         56890
 [Subtitle: la sua congiura, i suoi processi e la sua pazzia]
 [Language: Italian]

The Blue Star, by Fletcher Pratt                                         56889

Importanza e risultati degli incrociamenti in avicoltura,                56888
 di Teodoro Pascal
 [Language: Italian]

And this is what I tried:

def search_by_etext():

    fhand = open('GUTINDEX.ALL')
    print("Search by ETEXT:")

    for line in fhand:
        if not line.startswith(" [") and not line.startswith("~"):
            if not line.startswith(" ") and not line.startswith("TITLE"):
                    words = line.rstrip()
                    words = line.lstrip()
                    words = words[-7:]
                    print (words)


search_by_etext()

Well the code mostly works. However for some lines it gives me part of title or other things. Like: This kind of output(), containing 'decott' which is a part of author name and shouldn't be here. This kind of output2

For this: The Bashful Earthquake, by Oliver Herford                                56765 [Subtitle: and Other Fables and Verses]

The House of Orchids and Other Poems, by George Sterling                 56764

North Italian Folk, by Alice Vansittart Strettel Carr                    56763  and Randolph Caldecott [Subtitle: Sketches of Town and Country Life]

Wild Life in New Zealand. Part 1, Mammalia, by George M. Thomson 56762 [Subtitle: New Zealand Board of Science and Art, Manual No. 2]

Universal Brotherhood, Volume 13, No. 10, January 1899, by Various 56761

De drie steden: Lourdes, door Émile Zola 56760 [Language: Dutch]

Another example:

4

For Rhandensche Jongens, door Jan Lens 56702 [Illustrator: Tjeerd Bottema] [Language: Dutch]

The Story of The Woman's Party, by Inez Haynes Irwin 56701

Mormon Doctrine Plain and Simple, by Charles W. Penrose 56700 [Subtitle: Or Leaves from the Tree of Life]

The Stone Axe of Burkamukk, by Mary Grant Bruce 56699 [Illustrator: J. Macfarlane]

The Latter-Day Prophet, by George Q. Cannon 56698 [Subtitle: History of Joseph Smith Written for Young People]

Here: Life] shouldn't be there. Lines starting with blank space has been parsed out with this:

if not line.startswith(" [") and not line.startswith("~"):

But Still I am getting those off values in my output results.

Upvotes: 1

Views: 182

Answers (2)

bruno desthuilliers
bruno desthuilliers

Reputation: 77942

Simple solution: regexps to the rescue !

import re
with open("etext.txt") as f:
    for line in f:
        match = re.search(r" (\d+)$", line.strip())
        if match:
            print(match.group(1))

the regular expression (\d+)$ will match "at least one space followed by 1 or more digits at the end of the string", and capture only the "one or more digits" group.

You can eventually improve the regexp - ie if you know all etext codes are exactly 5 digits long, you can change the regexp to (\d{5})$.

This works with the example text you posted. If it doesn't properly work on your own file then we need enough of the real data to find out what you really have.

Upvotes: 4

brabster
brabster

Reputation: 43600

It could be that those extra lines that are not being filtered out start with whitespace other than a " " char, like a tab for example. As a minimal change that might work, try filtering out lines that start with any whitespace rather than specifically a space char?

To check for whitespace in general rather than a space char, you'll need to use regular expressions. Try if not re.match(r'^\s', line) and ...

Upvotes: 1

Related Questions