Data
Data

Reputation: 769

Writing to JSON file, then reading this same file and getting "JSONDecodeError: Extra data"

I have a very large json file (9GB). I'm reading in one object from it at a time, and then deleting key-value pairs in this object when the key is not in the list fields.

Each object is basically someone's user profile on a job searching website, but it comes with many unwanted key-value pairs that are not relevant to my analysis. There are about 3 million of these profiles.

I'd like to write each new profile/object to a json file, cleaned.json. Essentially this should be a copy of the original json file, except any of the key-value pairs not mentioned in fields have been removed from all 3 million profiles.

To do this, I wrote the following code:

# fields to keep
fields = ["skills", "industry", "summary", "education", "experience"]

with open('cleaned.json', 'w', encoding='UTF8') as f:
        for profile in open(path_to_file, encoding = 'UTF8'):
            profile = json.loads(profile)

            # remove unwanted fields from profile
            for key in list(profile.keys()):
                if key not in fields:
                    del(profile[key])

            # write profile to new json file
            json.dump(profile, f)

To test whether it worked, I tried reading the json file in again, like so:

for foo in open('cleaned.json', encoding='UTF8'):
    foo = json.loads(foo)
    print(json.dumps(foo, indent=4))

But I'm getting this error: JSONDecodeError: Extra data on the foo = json.loads(foo) line.

I've tested this by only modifying 1 profile from the original json and writing this modified profile to cleaned.json, and cleaned.json looks like this (except it's all on one line, I've just pretty printed it for this post):

{
    "skills": [
        "Key Account Development",
        "Strategic Planning",
        "Market Planning",
        "Team Leadership",
        "Negotiation",
        "Forecasting",
        "Key Account Management",
        "Sales Management",
        "New Business Development",
        "Business Planning",
        "Cross-functional Team Leadership",
        "Budgeting",
        "Strategy Development",
        "Business Strategy",
        "Consultative Selling",
        "Medical Devices",
        "Customer Relations",
        "Contract Negotiation",
        "Mentoring",
        "Coaching",
        "Healthcare",
        "Territory",
        "Sales Process",
        "Direct Sales",
        "Sales Operations",
        "Pharmaceutical Sales"
    ],
    "industry": "Medical Devices",
    "summary": "SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJECT MANAGEMENTDOMESTIC & INTERNATIONAL KEY ACCOUNT MANAGEMENTBusiness and Sales Executive with 20 years of accomplished career track, reflecting extensive experience and dynamic record-breaking performance in the Medical Industry markets. Exceptional communicator, strong team player, flexible self-starter with consultative sales style, strong negotiations skills, exceptional problem solving abilities, and accurate customer assessment aptitude. Manage and lead teams to success, drive new business through key accounts management, establish partnerships, manage solid distributor relationship for increased profitability and sales volumes. Very well organized, accurate and on-time administrative work, with a track record that demonstrates self-motivation, creativity, sales team leadership, initiative to achieve corporate, team and personal goals. Experience in the following markets: Medical Devices, Medical Disposables, Capital Equipment, Pharmaceuticals."
}{
    "education": [
        {
            "start": "2008",
            "major": "Economics",
            "end": "2008",
            "name": "Columbia University - Columbia Business School",
            "desc": "Coursework \"Principals of Economics\" ECON1105\tSpring 2008"
        },
        {
            "start": "2007",
            "end": "2007",
            "name": "Columbia University - Columbia Business School"
        },
        {
            "major": "Cancer genomics",
            "end": "2001",
            "name": "G\u00f6teborgs universitet",
            "degree": "Ph.D.",
            "start": "1996",
            "desc": "Thesis: \"The role of p53 in tumor progression and prognosis in patients with primary colorectal cancer\""
        },
        {
            "start": "1994",
            "major": "Biology, Medicine;German Language",
            "end": "1995",
            "name": "Universit\u00e4t Regensburg",
            "degree": "Cancer Research, Coursework"
        },
        {
            "major": "Biology",
            "end": "1994",
            "name": "G\u00f6teborgs universitet",
            "degree": "Master",
            "start": "1989",
            "desc": ""
        },
        {
            "start": "1992",
            "major": "50% Biology and Medicine, 50% mixed music, sports, computer science, art etc",
            "end": "1993",
            "name": "The University of Georgia",
            "desc": "Scholarship for one full year of Graduate Studies."
        }
    ],
    "skills": [
        "Molecular Biology",
        "Biomarkers"
    ],
    "industry": "Pharmaceuticals",
    "experience": [
        {
            "org": "Johnson and Johnson",
            "title": "Senior Scientist, Oncology Biomarkers",
            "end": "Present",
            "start": "November 2009",
            "desc": "Biomarker Leader for compounds in clinical development.*Developing and implementing predictive and pharmacodynamic biomarkers for the use in Phase 0 - III oncology clinical trials.."
        },
        {
            "org": "Albert Einstein Medical Center",
            "title": "Associate at Dept of Molecular Genetics",
            "start": "September 2008",
            "desc": "Single Cell Gene expression."
        },
        {
            "org": "Columbia University",
            "title": "Associate Research Scientist",
            "start": "August 2006",
            "desc": "Work on peptide to restore wt p53 function in cancer."
        },
        {
            "org": "Memorial Sloan Kettering Cancer Center",
            "title": "Post Doctoral Research Fellow",
            "start": "January 2003",
            "desc": "Molecular profiling of colorectal cancer."
        },
        {
            "org": "Sahlgrenska University Hospital",
            "title": "Research Scientist",
            "start": "November 2001",
            "desc": "Cancer Research at Dept of Surgery.Molecular profiling of Colorectal Cancer with focus on p53."
        }
    ],
    "summary": "Ph.D. scientist with background in cancer research, translational medicine and early drug development with special focus on biomarkers and personalized medicine."
}

So when I read this in, I'm getting the error. What am I doing wrong? I guess there is something wrong with the way I'm writing the profile to cleaned.json?


Sample input for testing

Sample input has 3 profiles.

{"_id": "in-00000001", "name": {"family_name": "Mazalu MBA", "given_name": "Dr Catalin"}, "locality": "United States", "skills": ["Key Account Development", "Strategic Planning", "Market Planning", "Team Leadership", "Negotiation", "Forecasting", "Key Account Management", "Sales Management", "New Business Development", "Business Planning", "Cross-functional Team Leadership", "Budgeting", "Strategy Development", "Business Strategy", "Consultative Selling", "Medical Devices", "Customer Relations", "Contract Negotiation", "Mentoring", "Coaching", "Healthcare", "Territory", "Sales Process", "Direct Sales", "Sales Operations", "Pharmaceutical Sales"], "industry": "Medical Devices", "summary": "SALES MANAGEMENT / BUSINESS DEVELOPMENT / PROJECT MANAGEMENTDOMESTIC & INTERNATIONAL KEY ACCOUNT MANAGEMENTBusiness and Sales Executive with 20 years of accomplished career track, reflecting extensive experience and dynamic record-breaking performance in the Medical Industry markets. Exceptional communicator, strong team player, flexible self-starter with consultative sales style, strong negotiations skills, exceptional problem solving abilities, and accurate customer assessment aptitude. Manage and lead teams to success, drive new business through key accounts management, establish partnerships, manage solid distributor relationship for increased profitability and sales volumes. Very well organized, accurate and on-time administrative work, with a track record that demonstrates self-motivation, creativity, sales team leadership, initiative to achieve corporate, team and personal goals. Experience in the following markets: Medical Devices, Medical Disposables, Capital Equipment, Pharmaceuticals.", "url": "http://www.linkedin.com/in/00000001", "also_view": [{"url": "http://www.linkedin.com/pub/krisa-drost/45/909/513", "id": "pub-krisa-drost-45-909-513"}, {"url": "http://ro.linkedin.com/pub/florin-ut/18/b33/77b", "id": "pub-florin-ut-18-b33-77b"}, {"url": "http://ro.linkedin.com/pub/cristian-radu/21/225/149", "id": "pub-cristian-radu-21-225-149"}, {"url": "http://ro.linkedin.com/pub/traian-rusu/16/652/279", "id": "pub-traian-rusu-16-652-279"}, {"url": "http://ro.linkedin.com/pub/dumitrescu-catalin/3/283/92", "id": "pub-dumitrescu-catalin-3-283-92"}, {"url": "http://www.linkedin.com/pub/jody-brelsford/9/21a/354", "id": "pub-jody-brelsford-9-21a-354"}, {"url": "http://www.linkedin.com/pub/mary-anne-dilloway/2/55a/18", "id": "pub-mary-anne-dilloway-2-55a-18"}, {"url": "http://ro.linkedin.com/pub/carmen-baleanu/2b/252/203", "id": "pub-carmen-baleanu-2b-252-203"}, {"url": "http://il.linkedin.com/in/shimonlobel", "id": "in-shimonlobel"}, {"url": "http://ro.linkedin.com/pub/monica-danilescu/19/36a/121", "id": "pub-monica-danilescu-19-36a-121"}]}
{"_id": "in-00001", "education": [{"start": "2008", "major": "Economics", "end": "2008", "name": "Columbia University - Columbia Business School", "desc": "Coursework \"Principals of Economics\" ECON1105\tSpring 2008"}, {"start": "2007", "end": "2007", "name": "Columbia University - Columbia Business School"}, {"major": "Cancer genomics", "end": "2001", "name": "G\u00f6teborgs universitet", "degree": "Ph.D.", "start": "1996", "desc": "Thesis: \"The role of p53 in tumor progression and prognosis in patients with primary colorectal cancer\""}, {"start": "1994", "major": "Biology, Medicine;German Language", "end": "1995", "name": "Universit\u00e4t Regensburg", "degree": "Cancer Research, Coursework"}, {"major": "Biology", "end": "1994", "name": "G\u00f6teborgs universitet", "degree": "Master", "start": "1989", "desc": ""}, {"start": "1992", "major": "50% Biology and Medicine, 50% mixed music, sports, computer science, art etc", "end": "1993", "name": "The University of Georgia", "desc": "Scholarship for one full year of Graduate Studies."}], "group": {"affilition": ["ASMALLWORLD.net", "Biomarker Research & Executive Network", "Biomarker Society", "Biomarkers", "Biomarkers in Discovery, Development and the Clinic Network", "Biotechnology/Pharmaceuticals", "Circulating Tumor Cell (CTC) and Cancer Stem Cell Group", "Clinical Development Job Opportunities - Europe", "Epigenetics", "Molecular Diagnostics Professional Network", "Molecular Diagnostics for Cancer Drug Development Forum", "NYC Women in Biotech", "Oncology Drug Development (Premier Group For Cancer Drug Development)", "Oncology Pharma\u2122", "Personalized Medicine", "Personalized Oncology Medicine - Global Group", "Professionals in the Pharmaceutical and Biotech Industry", "Svenskar i New York", "Translational Medicine Alliance"]}, "name": {"family_name": "Forslund", "given_name": "Ann"}, "overview_html": "<dl id=\"overview\"><dt id=\"overview-summary-current-title\" class=\"summary-current\" style=\"display:block\">\nCurrent\n</dt>\n<dd class=\"summary-current\" style=\"display:block\">\n<ul class=\"current\"><li>\nSenior Scientist, Oncology Biomarkers\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/johnson-&amp;-johnson?trk=ppro_cprof\"><span class=\"org summary\">Johnson and Johnson</span></a>\n</li>\n</ul></dd>\n<dt id=\"overview-summary-past-title\" class=\"summary-past\" style=\"display:block\">\nPast\n</dt>\n<dd class=\"summary-past\" style=\"display:block\">\n<ul class=\"past\"><li>\nAssociate at Dept of Molecular Genetics\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/einstein-medical-center-philadelphia?trk=ppro_cprof\"><span class=\"org summary\">Albert Einstein Medical Center</span></a>\n</li>\n<li>\nAssociate Research Scientist\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/columbia-university?trk=ppro_cprof\"><span class=\"org summary\">Columbia University</span></a>\n</li>\n<li>\nPost Doctoral Research Fellow\n<span class=\"at\">at </span>\nMemorial Sloan Kettering Cancer Center\n</li>\n</ul><div class=\"showhide-block\" id=\"morepast\">\n<ul class=\"past\"><li>\nResearch Scientist\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/sahlgrenska-university-hospital?trk=ppro_cprof\"><span class=\"org summary\">Sahlgrenska University Hospital</span></a>\n</li>\n</ul><p class=\"seeall showhide-link\"><a href=\"#\" id=\"morepast-hide\">see less</a></p>\n</div>\n<p class=\"seeall showhide-link\"><a href=\"#\" id=\"morepast-show\">see all</a></p>\n</dd>\n<dt id=\"overview-summary-education-title\" class=\"summary-education\" style=\"display:block\">\nEducation\n</dt>\n<dd class=\"summary-education\" style=\"display:block\">\n<ul><li>\nColumbia University - Columbia Business School\n</li>\n<li>\nColumbia University - Columbia Business School\n</li>\n<li>\nG\u00f6teborgs universitet\n</li>\n</ul><div class=\"showhide-block\" id=\"moreedu\">\n<ul><li>\n<div name=\"education\">\nUniversit\u00e4t Regensburg\n</div>\n</li>\n<li>\n<div name=\"education\">\nG\u00f6teborgs universitet\n</div>\n</li>\n<li>\n<div name=\"education\">\nThe University of Georgia\n</div>\n</li>\n</ul><p class=\"seeall showhide-link\"><a href=\"#\" id=\"moreedu-hide\">see less</a></p>\n</div>\n<p class=\"seeall showhide-link\"><a href=\"#\" id=\"moreedu-show\">see all</a></p>\n</dd>\n<dt>\nConnections\n</dt>\n<dd class=\"overview-connections\">\n<p>\n<strong>244</strong> connections\n</p>\n</dd>\n</dl>", "locality": "Antwerp Area, Belgium", "skills": ["Molecular Biology", "Biomarkers"], "industry": "Pharmaceuticals", "interval": 20, "experience": [{"org": "Johnson and Johnson", "title": "Senior Scientist, Oncology Biomarkers", "end": "Present", "start": "November 2009", "desc": "Biomarker Leader for compounds in clinical development.*Developing and implementing predictive and pharmacodynamic biomarkers for the use in Phase 0 - III oncology clinical trials.."}, {"org": "Albert Einstein Medical Center", "title": "Associate at Dept of Molecular Genetics", "start": "September 2008", "desc": "Single Cell Gene expression."}, {"org": "Columbia University", "title": "Associate Research Scientist", "start": "August 2006", "desc": "Work on peptide to restore wt p53 function in cancer."}, {"org": "Memorial Sloan Kettering Cancer Center", "title": "Post Doctoral Research Fellow", "start": "January 2003", "desc": "Molecular profiling of colorectal cancer."}, {"org": "Sahlgrenska University Hospital", "title": "Research Scientist", "start": "November 2001", "desc": "Cancer Research at Dept of Surgery.Molecular profiling of Colorectal Cancer with focus on p53."}], "summary": "Ph.D. scientist with background in cancer research, translational medicine and early drug development with special focus on biomarkers and personalized medicine.", "url": "http://be.linkedin.com/in/00001", "also_view": [{"url": "http://www.linkedin.com/pub/peter-king/4/993/a16", "id": "pub-peter-king-4-993-a16"}, {"url": "http://www.linkedin.com/pub/hans-winkler/1/1ab/78a", "id": "pub-hans-winkler-1-1ab-78a"}, {"url": "http://de.linkedin.com/pub/michael-koslowski/26/964/99b", "id": "pub-michael-koslowski-26-964-99b"}, {"url": "http://de.linkedin.com/pub/werner-seiz/b/14/436", "id": "pub-werner-seiz-b-14-436"}, {"url": "http://de.linkedin.com/pub/miro-venturi/7/725/217", "id": "pub-miro-venturi-7-725-217"}, {"url": "http://ch.linkedin.com/pub/lisa-d-amato/3/808/267", "id": "pub-lisa-d-amato-3-808-267"}, {"url": "http://www.linkedin.com/pub/june-kaplow-ph-d/2/382/924", "id": "pub-june-kaplow-ph-d-2-382-924"}, {"url": "http://fr.linkedin.com/pub/fabien-schmidlin/b/b73/4b2", "id": "pub-fabien-schmidlin-b-b73-4b2"}, {"url": "http://be.linkedin.com/pub/tine-casneuf/2/563/884", "id": "pub-tine-casneuf-2-563-884"}, {"url": "http://be.linkedin.com/pub/jeroen-aerssens/0/b9a/6ba", "id": "pub-jeroen-aerssens-0-b9a-6ba"}], "specilities": "Biomarkers in Oncology, Cancer Genomics, Molecular Profiling of Cancer, Translational Cancer Research, Early Development Drug Discovery", "events": [{"from": "Sahlgrenska University Hospital", "to": "Memorial Sloan Kettering Cancer Center", "title1": "Research Scientist", "start": 24022, "title2": "Post Doctoral Research Fellow", "end": 24036}, {"from": "Memorial Sloan Kettering Cancer Center", "to": "Columbia University", "title1": "Post Doctoral Research Fellow", "start": 24036, "title2": "Associate Research Scientist", "end": 24079}, {"from": "Columbia University", "to": "Albert Einstein Medical Center", "title1": "Associate Research Scientist", "start": 24079, "title2": "Associate at Dept of Molecular Genetics", "end": 24104}, {"from": "Albert Einstein Medical Center", "to": "Johnson and Johnson", "title1": "Associate at Dept of Molecular Genetics", "start": 24104, "title2": "Senior Scientist, Oncology Biomarkers", "end": 24118}]}
{"_id": "in-00006", "interests": "personal genomics, nanotechnology", "education": [{"major": "Biophysics", "end": "2009", "name": "Harvard University", "degree": "Ph.D", "start": "2004", "desc": ""}, {"major": "Computer Science", "end": "2003", "name": "Yale University", "degree": "B.S.", "start": "1999", "desc": ""}], "name": {"family_name": "Douglas", "given_name": "Shawn"}, "overview_html": "<dl id=\"overview\"><dt id=\"overview-summary-current-title\" class=\"summary-current\" style=\"display:block\">\nCurrent\n</dt>\n<dd class=\"summary-current\" style=\"display:block\">\n<ul class=\"current\"><li>\nAssistant Professor\n<span class=\"at\">at </span>\nUCSF\n</li>\n</ul></dd>\n<dt id=\"overview-summary-past-title\" class=\"summary-past\" style=\"display:block\">\nPast\n</dt>\n<dd class=\"summary-past\" style=\"display:block\">\n<ul class=\"past\"><li>\nTechnology Development Fellow\n<span class=\"at\">at </span>\n<a class=\"company-profile-public\" href=\"/company/wyss-institute-for-biologically-inspired-engineering?trk=ppro_cprof\"><span class=\"org summary\">Wyss Institute for Biologically Inspired Engineering</span></a>\n</li>\n</ul></dd>\n<dt id=\"overview-summary-education-title\" class=\"summary-education\" style=\"display:block\">\nEducation\n</dt>\n<dd class=\"summary-education\" style=\"display:block\">\n<ul><li>\nHarvard University\n</li>\n<li>\nYale University\n</li>\n</ul></dd>\n<dt>\nConnections\n</dt>\n<dd class=\"overview-connections\">\n<p>\n<strong>164</strong> connections\n</p>\n</dd>\n<dt class=\"websites\">Websites</dt>\n<dd class=\"websites\">\n<ul><li>\n<a href=\"/redir/redirect?url=http%3A%2F%2Fbionano%2Eucsf%2Eedu%2F&amp;urlhash=JefI\" target=\"_blank\" title=\"New window will open\" name=\"overviewsite\">\nCompany Website\n</a>\n</li>\n<li>\n<a href=\"/redir/redirect?url=http%3A%2F%2Fwww%2Eshawndouglas%2Ecom%2F&amp;urlhash=Loa8\" target=\"_blank\" title=\"New window will open\" name=\"overviewsite\">\nPersonal Website\n</a>\n</li>\n<li>\n<a href=\"/redir/redirect?url=http%3A%2F%2Fbiomod%2Enet%2F&amp;urlhash=vQXo\" target=\"_blank\" title=\"New window will open\" name=\"overviewsite\">\nBIOMOD\n</a>\n</li>\n</ul></dd>\n</dl>", "locality": "San Francisco, California", "skills": ["DNA", "Nanotechnology", "Molecular Biology", "Software Development"], "industry": "Research", "interval": 0, "experience": [{"org": "UCSF", "title": "Assistant Professor", "end": "Present", "start": "September 2012"}, {"org": "Wyss Institute for Biologically Inspired Engineering", "title": "Technology Development Fellow", "start": "May 2009"}], "summary": "I am interested in inventing new methods to construct and manipulate biological molecules at the nanometer scale, toward developing new scientific tools and therapeutic devices.", "url": "http://www.linkedin.com/in/00006", "also_view": [{"url": "http://www.linkedin.com/pub/george-church/1/630/2b8", "id": "pub-george-church-1-630-2b8"}, {"url": "http://www.linkedin.com/pub/andrew-hessel/4/4b0/290", "id": "pub-andrew-hessel-4-4b0-290"}, {"url": "http://www.linkedin.com/pub/ayis-antoniou/0/216/630", "id": "pub-ayis-antoniou-0-216-630"}, {"url": "http://uk.linkedin.com/pub/matthew-bellis/35/973/888", "id": "pub-matthew-bellis-35-973-888"}, {"url": "http://www.linkedin.com/pub/john-mulligan-ph-d/7/5a3/5aa", "id": "pub-john-mulligan-ph-d-7-5a3-5aa"}, {"url": "http://www.linkedin.com/pub/yang-mao/38/621/a83", "id": "pub-yang-mao-38-621-a83"}, {"url": "http://www.linkedin.com/pub/sidney-wang/25/3b8/b84", "id": "pub-sidney-wang-25-3b8-b84"}, {"url": "http://www.linkedin.com/pub/yang-mao/9/815/369", "id": "pub-yang-mao-9-815-369"}, {"url": "http://www.linkedin.com/pub/j-markson/32/572/10", "id": "pub-j-markson-32-572-10"}], "homepage": {"BIOMOD": ["http://biomod.net/"], "Company Website": ["http://bionano.ucsf.edu/"], "Personal Website": ["http://www.shawndouglas.com/"]}, "events": [{"from": "Wyss Institute for Biologically Inspired Engineering", "to": "UCSF", "title1": "Technology Development Fellow", "start": 24112, "title2": "Assistant Professor", "end": 24152}]}

Upvotes: 0

Views: 180

Answers (2)

martineau
martineau

Reputation: 123453

Here's code that seems to work with your sample input. As I said in a comment the file you are dealing with is in something called JSON Lines format rather than JSON format.

Since you appear to want the cleaned version in that same format (in other words, not converted to standard JSON format, as I thought a one point), here's how to do that:

import json

path_to_file = "sample_input.json"
cleaned_file = "cleaned.json"

# Fields to keep.
fields = ["skills", "industry", "summary", "education", "experience"]

# Clean profiles in JSON Lines format file.
with open(path_to_file, encoding='UTF8') as inf, \
     open(cleaned_file, 'w', encoding='UTF8') as outf:

    for line in inf:
        profile = json.loads(line)  # Read a profile object.
        for key in list(profile.keys()):  # Remove unwanted fields it.
            if key not in fields:
                del profile[key]
        outf.write(json.dumps(profile) + '\n') # Write cleaned profile to new file

# Test whether it worked.
with open(cleaned_file, encoding='UTF8') as cleaned:
    for line in cleaned:
        profile = json.loads(line)
        print(json.dumps(profile, indent=4))

Upvotes: 1

Christian Sauer
Christian Sauer

Reputation: 10889

You are basically dumping new json objects into a file every time you are calling json.dump(profile, f). But that does not generate valid JSON, since it does not emped the objects correctly. E.g. {}{} instead of {{},{}}

As for a solution - the size of your JSON makes reading / writing while holding everything in memory a bad solution. I would probably try the library https://pypi.org/project/jsonstreams/ or something like this.

Upvotes: 0

Related Questions