Reputation: 1170
I have a pandas dataframe containing some business data, and a json that contains (among other things) a way to map one identifier to another (more granular to less granular). Is there any python or pandas function to accomplish this without converting the json to a dataframe first?
e.g.
input data frame
Bob, 123, Jakes Lane
mapping json
json = {'street number':123, 'street name':'Jakes Lane', postalcode='A1B2C3'}
What I want as the output is
Bob, 123, Jakes Lane, A1B2C3
Upvotes: 0
Views: 52
Reputation: 18782
This demo code shows how to iterate all rows of a dataframe and collect values (zipcode) from a JSON. In the final step, the values are used to create a new column of the dataframe
import pandas as pd
from io import StringIO
data2 = """ID, Name, Number, Street
1, Bob, 123, Jakes Lane
2, Cat, 415, Carrot Alley
3, Derek, 741, Dirt Lane"""
# Proposed JSON object data
better_json = {
123: {'street number': 123, 'street name': 'Jakes Lane', 'postalcode': 'A1B2C3'},
741: {'street number': 741, 'street name': 'Dirt Lane', 'postalcode': 'A2Z4C6'},
531: {'street number': 531, 'street name': 'Fair Lane', 'postalcode': 'A0Z1C9'}
}
df2 = pd.read_csv(StringIO( data2 ), sep=',\s+', index_col='ID', engine='python')
print(df2)
# looping and get zipcode
link_values = []
for i, item in df2.Number.iteritems():
try:
#print(i, item, better_json[item]['postalcode'])
link_values.append(better_json[item]['postalcode'])
except:
#print(i, item, "No match!")
link_values.append("")
df2["zipcode"] = link_values
print()
print(df2)
The output printed by the code:
Name Number Street
ID
1 Bob 123 Jakes Lane
2 Cat 415 Carrot Alley
3 Derek 741 Dirt Lane
Name Number Street zipcode
ID
1 Bob 123 Jakes Lane A1B2C3
2 Cat 415 Carrot Alley
3 Derek 741 Dirt Lane A2Z4C6
Upvotes: 1
Reputation: 1247
Per your comment, fixed code to do join using 2 methods.
First, we need to define problem data. There are 2 datasets
goal is to join 2 datasets to generate list of name, st number, st name, posttalcode - generate CSV (comma separated line per person)
Below are 2 methods to do join. IMHO, doing it with temp dataframe would be much better. Unsure of size of each dataset, assume they are LARGE. Thus METHOD2 seems to be better approach as it does 1 pass thru json and 1 pass thru dataframe.
Here is the data with a json.load():
import json
df = pd.DataFrame([{"name":"Bob", "street number":123, "street name":"Jakes Lane"}])
# Define JSON content (of file content)
jstr = '{"addrList" : [{"street number":123, "street name":"Jakes Lane", "postalcode" : "A1B2C3"}]}'
# Import json into dict & convert to string
jDict = json.loads(jstr)
METHOD 1 : for every address in json, do apply add postal code that match
df['postalcode'] = np.nan
sn = 'street number'
snm = 'street name'
cols = ['name','street number','street']
infoList = []
for addr in jDict['addrList']:
pc = addr['postalcode']
infoList = list(df.apply(lambda r: '%s, %s, %s, %s'%(r['name'],r[sn],r[snm],pc) if (r[sn] == addr[sn]) & (r[snm] == addr[snm]) else r.postalcode, axis=1))
infoList
METHOD 2 : Transform json to dict index by st number/name; search dataframe for match
d = {str(a[sn])+a[snm]:a['postalcode'] for a in jDict['addrList']}
def matchAddr(r):
a = str(r[sn])+r[snm]
try: pc = d[a]
except: pc = np.nan
return '%s, %s, %s, %s'%(r['name'],r[sn],r[snm],pc)
list(df.apply(lambda r : matchAddr(r), axis=1))
Upvotes: 1