Reputation: 769
Apologies if this is very simple or has already been asked, I am new to Python and working with json files, so I'm quite confused.
I have a 9 GB json file scraped from a website. This data consists of information about some 3 million individuals. Each individual has attributes, but not all individuals have the same attributes. An attribute corresponds to a key in the json file, like so:
{
"_id": "in-00000001",
"name": {
"family_name": "Trump",
"given_name": "Donald"
},
"locality": "United States",
"skills": [
"Twitter",
"Real Estate",
"Golf"
],
"industry": "Government",
"experience": [
{
"org": "Republican",
"end": "Present",
"start": "January 2017",
"title": "President of the United States"
},
{
"org": "The Apprentice",
"end": "2015",
"start": "2003",
"title": "The guy that fires people"
}]
}
So here, _id
, name
, locality
, skills
, industry
and experience
are attributes (keys). Another profile may have additional attributes, like education
, awards
, interests
, or lack some attribute found in another profile, like the skills
attribute, and so on.
What I'd like to do is scan through each profile in the json file, and if a profile contains the attributes skills
, industry
and experience
, I'd like to extract that information and insert it into a data frame (I suppose I need Pandas for this?). From experience
, I would want to specifically extract the name of their current employer, i.e. the most recent listing under org
. The data frame would look like this:
Industry | Current employer | Skills
___________________________________________________________________
Government | Republican | Twitter, Real Estate, Golf
Marketing | Marketers R Us | Branding, Social Media, Advertising
... and so on for all profiles with these three attributes.
I'm struggling to find a good resource that explains how to do this kind of thing, hence my question.
I suppose rough pseudocode would be:
for each profile in open(path to .json file):
if profile has keys "experience", "industry" AND "skills":
on the same row of the data frame:
insert current employer into "current employer" column of
data frame
insert industry into "industry" column of data frame
insert list of skills into "skills" column of data frame
I just need to know how to write this in Python.
Upvotes: 3
Views: 2361
Reputation: 4866
I assume the file contains all the profiles, such as
{
"profile 1" : {
# Full object as in the example above
},
"profile 2" : {
#Full object as in the example above
}
}
Before continuing, let me show a correct way to use Pandas DataFrames.
Values in a Pandas DataFrame cannot be lists. So we will have to duplicate lines as shown in the example below. Check this question and JD Long's answer for more detail: how to use lists as values in pandas dataframe?
ID | Industry | Current employer | Skill
___________________________________________________________________
in-01 | Government | Republican | Twitter
in-01 | Government | Republican | Real Estate
in-01 | Government | Republican | Golf
in-02 | Marketing | Marketers R Us | Branding
in-02 | Marketing | Marketers R Us | Social Media
in-02 | Marketing | Marketers R Us | Advertising
Find explainations within comments in the code below:
import json
import pandas as pd
# Create a DataFrame df with the columns as in the example
df = pd.DataFrame(data, columns = ['ID', 'Industry','Employer','Skill'])
#Load the file as json.
with open(path to .json file) as file:
#readlines() reads the file as string and loads() loads it into a dict
obj = json.loads(''.join(file.readlines()))
#Then iterate its items() as key value pairs
#But the line of code below depends on my first assumption.
#Depending on the file format, the line below might have to differ.
for prof_key, profile in obj.items():
# Verify if a profile contains all the required keys
if all(key in profile.keys() for key in ("_id","experience", "industry","skills")):
for skill in profile["skills"]:
df.loc[-1] = [profile["_id"],
profile["industry"],
[x for x in profile["experience"] if x["end"] == "Present"][0]["org"],
skill]
The above line, df.loc[-1] = ...
inserts a row in the dataframe as last row (index -1
).
When later you wish to use this information, you will have to use df.groupby('ID')
Let me know if you have different format(s) in your file(s) and if this explaination is sufficient to get you started or you need more.
Upvotes: 2