Improve performance of converting list of JSON string to Dataframe

Question

For input, I have a dictionary

{
"TAX:10672": "[
    {\"entity_id\":10672,\"profile_id\":20321,\"metric_type_name\":\"CAPEX\",\"metric\":null,\"perform_metric\":null},
    {\"entity_id\":10672,\"profile_id\":32583,\"metric_type_name\":\"CAPEX\",\"metric\":null,\"perform_metric\":null},
    {\"entity_id\":10672,\"profile_id\":8526,\"metric_type_name\":\"CAPEX\",\"metric\":null,\"perform_metric\":null}
]",    
"TAX:10869": "[
    {\"entity_id\":10869,\"profile_id\":20430,\"metric_type_name\":\"OPEX\",\"metric\":null,\"perform_metric\":null,},
    {\"entity_id\":10869,\"profile_id\":32692,\"metric_type_name\":\"CAPEX\",\"metric\":null,\"perform_metric\":null},
    {\"entity_id\":10869,\"profile_id\":8631,\"metric_type_name\":\"Revenue\",\"metric\":null,\"perform_metric\":null}
]"

}

In the code given below, I have taken the dictionary values converted them to a list of JSON and then to a Dataframe with columns as "entity_id", "profile_id", "metric_type_name" etc.

input_dict = /*Sample values given above*/
temp = list()
{temp.append(pd.read_json(v)) for v in list(input_dict.values())}
output_df = pd.concat(temp)

However, the performance is very poor, for a directory of 5000 entries it takes approx 100-120 seconds

I want to know if there is a way to improve the performance of this code further

Alexander Volkovsky · Accepted Answer

You can use fast third-party libraries to parse json first (orjson, ujson), then feed them into pandas as dicts. An example using orjson:

import orjson
from itertools import chain

parsed = map(orjson.loads, input_dict.values())
output_df = pd.DataFrame(chain.from_iterable(parsed))

Note that your input_dict values must be a valid json array (no trailing commas, etc)

Improve performance of converting list of JSON string to Dataframe

Answers (1)

Related Questions