Reputation: 187
The pandas df is created by reading CSV file with more than 100+ columns and 2+ mil rows which looks similar to
project_code,dict_data
A001,{"a": "1", "b": "2", "c": "3"}
B001,{"a": "4", "b": "5", "c": "6"}
C001,{"b": "7", "c": "8"}
D001,{"c": "9"}
E001,{"a": "10", "c": "11"}
Using for loop & if conditions, the dict column key value data is parsed and the df looks as follow. And at times new keys get added and I have to update my if conditions to parse those keys
project_code dict_data a b c d
A001 {"a": "1", "b": "2", "c": "3"} "1" "2" "3" ""
B001 {"a": "4", "b": "5", "d": "6"} "4" "5" "" "6"
C001 {"b": "7", "c": "8"} "" "7" "8" ""
D001 {"d": "9"} "" "" "" "9"
E001 {"a": "10", "c": "11"} "10" "" "11" ""
The code takes hours to run and is not the best or efficient way for processing dict data column. Would appreciate if someone could guide or suggest me the better & efficient way.
Upvotes: 0
Views: 55
Reputation: 20450
Two million rows is an immediate tip off.
Use a relational database already! Postgres would be a good choice, but even the very simple sqlite that comes standard with pythons built-in libraries would be a big improvement on what you’re doing.
The magic of a DB table which is indexed is you can often access just the subset of data that you need, while leaving most rows untouched on disk. This is quite different from being forced to load two million rows before you can even begin to do anything with your data.
Pandas has good support for writing a dataframe to a table and for reading table rows into a new data frame.
Upvotes: 0