Reputation: 1445
For input, I have a dictionary
{
"TAX:10672": "[
{\"entity_id\":10672,\"profile_id\":20321,\"metric_type_name\":\"CAPEX\",\"metric\":null,\"perform_metric\":null},
{\"entity_id\":10672,\"profile_id\":32583,\"metric_type_name\":\"CAPEX\",\"metric\":null,\"perform_metric\":null},
{\"entity_id\":10672,\"profile_id\":8526,\"metric_type_name\":\"CAPEX\",\"metric\":null,\"perform_metric\":null}
]",
"TAX:10869": "[
{\"entity_id\":10869,\"profile_id\":20430,\"metric_type_name\":\"OPEX\",\"metric\":null,\"perform_metric\":null,},
{\"entity_id\":10869,\"profile_id\":32692,\"metric_type_name\":\"CAPEX\",\"metric\":null,\"perform_metric\":null},
{\"entity_id\":10869,\"profile_id\":8631,\"metric_type_name\":\"Revenue\",\"metric\":null,\"perform_metric\":null}
]"
}
In the code given below, I have taken the dictionary values converted them to a list of JSON and then to a Dataframe with columns as "entity_id", "profile_id", "metric_type_name" etc.
input_dict = /*Sample values given above*/
temp = list()
{temp.append(pd.read_json(v)) for v in list(input_dict.values())}
output_df = pd.concat(temp)
However, the performance is very poor, for a directory of 5000 entries it takes approx 100-120 seconds
I want to know if there is a way to improve the performance of this code further
Upvotes: 0
Views: 535
Reputation: 2918
You can use fast third-party libraries to parse json first (orjson, ujson), then feed them into pandas as dicts. An example using orjson:
import orjson
from itertools import chain
parsed = map(orjson.loads, input_dict.values())
output_df = pd.DataFrame(chain.from_iterable(parsed))
Note that your input_dict values must be a valid json array (no trailing commas, etc)
Upvotes: 1