Reputation: 53

Pandas parse csv column from dict into table

My csv file:

FILE_INFO,                                                                                                                                        CATEGORY,    AREA,              BOX,                     NAME
"{'id': 1, 'width': 4032, 'height': 3024, 'file_name': 'pic1.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2247.8981,"[2283.54, 934.13, 27.37, 82.13]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 2, 'width': 4032, 'height': 3024, 'file_name': 'pic2.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2450.7795,"[2239.91, 1284.21, 33.15, 73.93]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 3, 'width': 4032, 'height': 3024, 'file_name': 'pic3.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,2548.956,"[2316.07, 301.5, 68.3, 37.32]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 4, 'width': 4032, 'height': 3024, 'file_name': 'pic4.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,1465.0172,"[3394.37, 1083.97, 26.99, 54.28]","{'subcategory': 'PAPER', 'occluded': False}"

How can I parse FILE_INFO column and get just file_name table without any other information. Same with NAME column and get only subcategory from it. Others tables are good.

Upvotes: 0

Answers (2)

Rob Raymond

Reputation: 31166

The main workaround is that the JSON column is not well formed to get into it with json.loads()

replace single quotes with double quotes to make it well formed
json.loads() to convert from string to dict
apply(pd.Series) to expand dict to columns

Now you have simple to navigate dataframe.

df = pd.read_csv(io.StringIO('''FILE_INFO,                                                                                                                                        CATEGORY,    AREA,              BOX,                     NAME
"{'id': 1, 'width': 4032, 'height': 3024, 'file_name': 'pic1.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2247.8981,"[2283.54, 934.13, 27.37, 82.13]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 2, 'width': 4032, 'height': 3024, 'file_name': 'pic2.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2450.7795,"[2239.91, 1284.21, 33.15, 73.93]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 3, 'width': 4032, 'height': 3024, 'file_name': 'pic3.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,2548.956,"[2316.07, 301.5, 68.3, 37.32]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 4, 'width': 4032, 'height': 3024, 'file_name': 'pic4.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,1465.0172,"[3394.37, 1083.97, 26.99, 54.28]","{'subcategory': 'PAPER', 'occluded': False}"'''))

df = df.drop(columns="FILE_INFO").join(df.FILE_INFO.apply(lambda x: json.loads(x.replace("'","\""))).apply(pd.Series))

	CATEGORY	AREA	BOX	NAME	id	width	height	file_name
0	PRODUCT	2247.9	[2283.54, 934.13, 27.37, 82.13]	{'subcategory': 'BOTTLE', 'occluded': False}	1	4032	3024	pic1.jpeg
1	PRODUCT	2450.78	[2239.91, 1284.21, 33.15, 73.93]	{'subcategory': 'BOTTLE', 'occluded': False}	2	4032	3024	pic2.jpeg
2	INDUSTRIAL litter	2548.96	[2316.07, 301.5, 68.3, 37.32]	{'subcategory': 'BOTTLE', 'occluded': False}	3	4032	3024	pic3.jpeg
3	INDUSTRIAL litter	1465.02	[3394.37, 1083.97, 26.99, 54.28]	{'subcategory': 'PAPER', 'occluded': False}	4	4032	3024	pic4.jpeg

Upvotes: 0

Nidal Barada

Reputation: 393

You can iterate over the values in a for loop and use JSON to extract the data you need.

So in a for loop you would do something like this:

import json

for row in rows:
    json.loads(row.replace("\'", "\""))['file_name']

Upvotes: 1

Pandas parse csv column from dict into table

Answers (2)

Related Questions