Reputation: 755
I have a dataframe where parameters
column is JSON and contains multiple actual rows and columns:
input_data = pandas.DataFrame({'id':['0001','0002','0003'],
'parameters':["{'product':['book','cat','fish'],'person':['me','you']}",
"'{'product':['book','cat'],'person':['me','you','us']}'",
"'{'product':['apple','snake','rabbit','octopus'],'person':['them','you','us','we','they']}'"]})
... from which I'd like to extract the following data frames:
product_data = pandas.DataFrame({'id':['0001','0001','0001','0002','0002','0003','0003','0003','0003'],
'product':['book','cat','fish','book','cat','apple','snake','rabbit','octopus']})
person_data = pandas.DataFrame({'id':['0001','0001','0002','0002','0002','0003','0003','0003','0003','0003'],
'person':['me','you','me','you','us','them','you','us','we','they']})
Below is how I've utilized Regular Expressions to get me there. I doubt this is the best way to do it but here it goes:
for i in input_data.id.tolist():
s = ''.join(input_data[input_data.id == i]['parameters'])
product_string = re.search(r"product':(.*?),'person", str(s)).group(1)
product_data = pandas.DataFrame(product_string[1:-1].split(','))
person_string = re.search(r"person':(.*?)}", str(s)).group(1)
person_data = pandas.DataFrame(person_string[1:-1].split(','))
print("........")
print(product_data)
print("........")
print(person_data)
I'd like to learn a faster, more elegant, or wholesome solution that may capture unexpected nuances.
Upvotes: 1
Views: 1944
Reputation: 3752
Given the weird structure of the strings in row 2 and 3 and the final output desired below is one version:
input_data = pd.DataFrame({'id':['0001','0002','0003'],
'parameters':["{'product':['book','cat','fish'],'person':['me','you']}",
"'{'product':['book','cat'],'person':['me','you','us']}'",
"'{'product':['apple','snake','rabbit','octopus'],'person':['them','you','us','we','they']}'"]})
input_data['parameters'] = input_data['parameters'].str.replace("'{", '{').str.replace("'{", '{').str.replace("}'", '}')
input_data = input_data.join(pd.DataFrame(input_data['parameters'].apply(literal_eval).values.tolist()))
products_len = input_data['product'].apply(len).values
persons_len = input_data['person'].apply(len).values
df
## flatten x into a list of dictionaries
values = input_data['person'].values.flatten().tolist()
flat_results = [item for sublist in values for item in sublist]
## reinsert a and b
person_df = pd.DataFrame(flat_results, columns = ['person'])
## flatten x into a list of dictionaries
values = input_data['product'].values.flatten().tolist()
flat_results = [item for sublist in values for item in sublist]
## reinsert a and b
product_df = pd.DataFrame(flat_results, columns = ['product'])
## person
ids = input_data['id'].repeat(persons_len).reset_index(drop=True)
person_df = person_df.join(ids)
## product
ids = input_data['id'].repeat(products_len).reset_index(drop=True)
product_df = product_df.join(ids)
person_df
Out[57]:
person id
0 me 0001
1 you 0001
2 me 0002
3 you 0002
4 us 0002
5 them 0003
6 you 0003
7 us 0003
8 we 0003
9 they 0003
product_df
Out[58]:
product id
0 book 0001
1 cat 0001
2 fish 0001
3 book 0002
4 cat 0002
5 apple 0003
6 snake 0003
7 rabbit 0003
8 octopus 0003
Upvotes: 0
Reputation: 59274
First, setup yor products and persons using str.get
accessor
input_data['products'] = input_data.parameters.str.get('product')
Now, for pandas >= 0.25.0
, you may use the explode
method
input_data.explode('products')
for pandas <= 0.25.0
, you may refer to this thread
I assumed you have dictionaries in your data frames, and not strings as you exposed here.
If you have strings, you may always
import ast
input_data.parameters.apply(ast.literal_eval)
to make them real dictionaries.
Upvotes: 2