Reputation: 53
My csv file:
FILE_INFO, CATEGORY, AREA, BOX, NAME
"{'id': 1, 'width': 4032, 'height': 3024, 'file_name': 'pic1.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2247.8981,"[2283.54, 934.13, 27.37, 82.13]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 2, 'width': 4032, 'height': 3024, 'file_name': 'pic2.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2450.7795,"[2239.91, 1284.21, 33.15, 73.93]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 3, 'width': 4032, 'height': 3024, 'file_name': 'pic3.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,2548.956,"[2316.07, 301.5, 68.3, 37.32]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 4, 'width': 4032, 'height': 3024, 'file_name': 'pic4.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,1465.0172,"[3394.37, 1083.97, 26.99, 54.28]","{'subcategory': 'PAPER', 'occluded': False}"
How can I parse FILE_INFO column and get just file_name table without any other information. Same with NAME column and get only subcategory from it. Others tables are good.
Upvotes: 0
Views: 60
Reputation: 31166
The main workaround is that the JSON column is not well formed to get into it with json.loads()
json.loads()
to convert from string to dict
apply(pd.Series)
to expand dict to columnsNow you have simple to navigate dataframe.
df = pd.read_csv(io.StringIO('''FILE_INFO, CATEGORY, AREA, BOX, NAME
"{'id': 1, 'width': 4032, 'height': 3024, 'file_name': 'pic1.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2247.8981,"[2283.54, 934.13, 27.37, 82.13]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 2, 'width': 4032, 'height': 3024, 'file_name': 'pic2.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",PRODUCT,2450.7795,"[2239.91, 1284.21, 33.15, 73.93]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 3, 'width': 4032, 'height': 3024, 'file_name': 'pic3.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,2548.956,"[2316.07, 301.5, 68.3, 37.32]","{'subcategory': 'BOTTLE', 'occluded': False}"
"{'id': 4, 'width': 4032, 'height': 3024, 'file_name': 'pic4.jpeg', 'license': 0, 'flickr_url': '', 'coco_url': '', 'date_captured': 0}",INDUSTRIAL litter,1465.0172,"[3394.37, 1083.97, 26.99, 54.28]","{'subcategory': 'PAPER', 'occluded': False}"'''))
df = df.drop(columns="FILE_INFO").join(df.FILE_INFO.apply(lambda x: json.loads(x.replace("'","\""))).apply(pd.Series))
CATEGORY | AREA | BOX | NAME | id | width | height | file_name | license | flickr_url | coco_url | date_captured | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PRODUCT | 2247.9 | [2283.54, 934.13, 27.37, 82.13] | {'subcategory': 'BOTTLE', 'occluded': False} | 1 | 4032 | 3024 | pic1.jpeg | 0 | 0 | ||
1 | PRODUCT | 2450.78 | [2239.91, 1284.21, 33.15, 73.93] | {'subcategory': 'BOTTLE', 'occluded': False} | 2 | 4032 | 3024 | pic2.jpeg | 0 | 0 | ||
2 | INDUSTRIAL litter | 2548.96 | [2316.07, 301.5, 68.3, 37.32] | {'subcategory': 'BOTTLE', 'occluded': False} | 3 | 4032 | 3024 | pic3.jpeg | 0 | 0 | ||
3 | INDUSTRIAL litter | 1465.02 | [3394.37, 1083.97, 26.99, 54.28] | {'subcategory': 'PAPER', 'occluded': False} | 4 | 4032 | 3024 | pic4.jpeg | 0 | 0 |
Upvotes: 0
Reputation: 393
You can iterate over the values in a for loop and use JSON to extract the data you need.
So in a for loop you would do something like this:
import json
for row in rows:
json.loads(row.replace("\'", "\""))['file_name']
Upvotes: 1