Tom
Tom

Reputation: 117

Python extract json strings from large textfile

I am working on a project that requires a large data set. I found a sufficiently large data set ( editions dump at https://openlibrary.org/developers/dumps, about 5GB) which is formatted

/type/edition   /books/OL10000135M  4   2010-04-24T17:54:01.503315  {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4} 
/type/edition   /books/OL10000179M  4   2010-04-24T17:54:01.503315  {"publishers": ["Stationery Office"], "physical_format": "Hardcover", "subtitle": "26 January - 4 February 1998", "title": "Parliamentary Debates, House of Lords, 1997-98", "isbn_10": ["0107805855"], "identifiers": {"goodreads": ["2862283"]}, "isbn_13": ["9780107805852"], "edition_name": "5th edition", "languages": [{"key": "/languages/eng"}], "number_of_pages": 124, "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "latest_revision": 4, "key": "/books/OL10000179M", "authors": [{"key": "/authors/OL2645811A"}], "publish_date": "January 1999", "works": [{"key": "/works/OL7925994W"}], "type": {"key": "/type/edition"}, "subjects": ["Bibliographies, catalogues, discographies", "POLITICS & GOVERNMENT", "Reference works", "Bibliographies & Indexes", "Reference"], "revision": 4}
 etc...

and I wan to extract the JSON part (the fifth field).

I am trying to use str.replace() (on a 50 line subset of the big file) but it is being finicky. I would think something like this would work, but it is not (nothing is changed/replaced)

 with fileinput.input(files=("testData.txt"), inplace=True, backup='.bak') as file:
    for line in file:
            print(line.replace(".*({.*})$", "\1"), end="")

I tried to parse it column by column (a regex that identifies each column) but then I ran into something that baffled me. The following code

 with fileinput.input(files=("testData.txt"), inplace=True, backup='.bak') as file:
    for line in file:
            print(line.replace("/type/edition\t/books/", "WORK PLZ"), end="")

yields

 WORK PLZOL10000135M    4   2010-04-24T17:54:01.503315  {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}
 WORK PLZOL10000179M    4   2010-04-24T17:54:01.503315  {"publishers": ["Stationery Office"], "physical_format": "Hardcover", "subtitle": "26 January - 4 February 1998", "title": "Parliamentary Debates, House of Lords, 1997-98", "isbn_10": ["0107805855"], "identifiers": {"goodreads": ["2862283"]}, "isbn_13": ["9780107805852"], "edition_name": "5th edition", "languages": [{"key": "/languages/eng"}], "number_of_pages": 124, "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "latest_revision": 4, "key": "/books/OL10000179M", "authors": [{"key": "/authors/OL2645811A"}], "publish_date": "January 1999", "works": [{"key": "/works/OL7925994W"}], "type": {"key": "/type/edition"}, "subjects": ["Bibliographies, catalogues, discographies", "POLITICS & GOVERNMENT", "Reference works", "Bibliographies & Indexes", "Reference"], "revision": 4}

but

 with fileinput.input(files=("testData.txt"), inplace=True, backup='.bak') as file:
    for line in file:
            print(line.replace("/type/edition\t/books/\w+", "WORK PLZ"), end="")

does not do anything. It seems like the \w+ is not picking up the alphanumeric string after the /books/.

Am I doing something wring with my regular expressions? Is there a better way to go about this?

Upvotes: 3

Views: 3829

Answers (1)

Jean-François Fabre
Jean-François Fabre

Reputation: 140168

(as mentionned in comments) str.replace doesn't understand regular expressions. That explains why your code fails.

I would partition the string (assuming that there isn't any { char before the json string), then parse as json:

import json
with open("test.txt") as f:
    for line in f:
        json_expr = "{"+line.partition("{")[2]
        the_dict = json.loads(json_expr)

or split according to spaces but use maxsplit parameter to limit the split and take the last element (the json data). Since the json expression is the last item, it works:

json_expr = line.split(None,4)[-1]

Upvotes: 2

Related Questions