alexuuu
alexuuu

Reputation: 53

Pandas parse csv column from dict into table

My csv file:

FILE_INFO,                                                                                                                                        CATEGORY,    AREA,              BOX,                     NAME
"{'id': 1, 'width': 4032, 'height': 3024, 'file_name': 'pic1.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2247.8981,"[2283.54, 934.13, 27.37, 82.13]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 2, 'width': 4032, 'height': 3024, 'file_name': 'pic2.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2450.7795,"[2239.91, 1284.21, 33.15, 73.93]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 3, 'width': 4032, 'height': 3024, 'file_name': 'pic3.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,2548.956,"[2316.07, 301.5, 68.3, 37.32]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 4, 'width': 4032, 'height': 3024, 'file_name': 'pic4.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,1465.0172,"[3394.37, 1083.97, 26.99, 54.28]","{'subcategory': 'PAPER', 'occluded': False}"

How can I parse FILE_INFO column and get just file_name table without any other information. Same with NAME column and get only subcategory from it. Others tables are good.

Upvotes: 0

Views: 60

Answers (2)

Rob Raymond
Rob Raymond

Reputation: 31166

The main workaround is that the JSON column is not well formed to get into it with json.loads()

  • replace single quotes with double quotes to make it well formed
  • json.loads() to convert from string to dict
  • apply(pd.Series) to expand dict to columns

Now you have simple to navigate dataframe.

df = pd.read_csv(io.StringIO('''FILE_INFO,                                                                                                                                        CATEGORY,    AREA,              BOX,                     NAME
"{'id': 1, 'width': 4032, 'height': 3024, 'file_name': 'pic1.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2247.8981,"[2283.54, 934.13, 27.37, 82.13]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 2, 'width': 4032, 'height': 3024, 'file_name': 'pic2.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2450.7795,"[2239.91, 1284.21, 33.15, 73.93]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 3, 'width': 4032, 'height': 3024, 'file_name': 'pic3.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,2548.956,"[2316.07, 301.5, 68.3, 37.32]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 4, 'width': 4032, 'height': 3024, 'file_name': 'pic4.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,1465.0172,"[3394.37, 1083.97, 26.99, 54.28]","{'subcategory': 'PAPER', 'occluded': False}"'''))

df = df.drop(columns="FILE_INFO").join(df.FILE_INFO.apply(lambda x: json.loads(x.replace("'","\""))).apply(pd.Series))


CATEGORY AREA BOX NAME id width height file_name license flickr_url coco_url date_captured
0 PRODUCT 2247.9 [2283.54, 934.13, 27.37, 82.13] {'subcategory': 'BOTTLE', 'occluded': False} 1 4032 3024 pic1.jpeg 0 0
1 PRODUCT 2450.78 [2239.91, 1284.21, 33.15, 73.93] {'subcategory': 'BOTTLE', 'occluded': False} 2 4032 3024 pic2.jpeg 0 0
2 INDUSTRIAL litter 2548.96 [2316.07, 301.5, 68.3, 37.32] {'subcategory': 'BOTTLE', 'occluded': False} 3 4032 3024 pic3.jpeg 0 0
3 INDUSTRIAL litter 1465.02 [3394.37, 1083.97, 26.99, 54.28] {'subcategory': 'PAPER', 'occluded': False} 4 4032 3024 pic4.jpeg 0 0

Upvotes: 0

Nidal Barada
Nidal Barada

Reputation: 393

You can iterate over the values in a for loop and use JSON to extract the data you need.

So in a for loop you would do something like this:

import json

for row in rows:
    json.loads(row.replace("\'", "\""))['file_name']

Upvotes: 1

Related Questions