ashtonjesse
ashtonjesse

Reputation: 29

Split / explode a column of dictionaries into separate columns with polars

I am working with a 3.2Gb csv and have loaded this into a polars dataframe for processing. One of the columns contains python dictionaries in string format like this (see "_source"):

shape: (5, 9)
┌─────────────┬───────┬─────────────┬──────────┬───┬─────────────┬───────┬────────────┬────────────┐
│ _index      ┆ _type ┆ _id         ┆ _version ┆ … ┆ _primary_te ┆ found ┆ _source    ┆ _ignored   │
│ ---         ┆ ---   ┆ ---         ┆ ---      ┆   ┆ rm          ┆ ---   ┆ ---        ┆ ---        │
│ str         ┆ str   ┆ str         ┆ i64      ┆   ┆ ---         ┆ bool  ┆ str        ┆ str        │
│             ┆       ┆             ┆          ┆   ┆ i64         ┆       ┆            ┆            │
╞═════════════╪═══════╪═════════════╪══════════╪═══╪═════════════╪═══════╪════════════╪════════════╡
│ index-strin ┆ _doc  ┆ e5EL4GUC2TF ┆ 8        ┆ … ┆ 11          ┆ true  ┆ {'update_d ┆ null       │
│ g-goes-here ┆       ┆ BdJZvL4eiTk ┆          ┆   ┆             ┆       ┆ t': '2023- ┆            │
│ -something  ┆       ┆             ┆          ┆   ┆             ┆       ┆ 10-26T07:5 ┆            │
│             ┆       ┆             ┆          ┆   ┆             ┆       ┆ 4:…        ┆            │

The dictionaries in the column have exactly the same set of keys and all the values have the same format between rows. I'd like to expand the contents of this column out so that there is a new column for each of the keys in the dictionary. I've checked the documentation and asked ChatGPT but can't find any answers.

In pandas, I've done this on a smaller dataset with something like:

df['_source'] = df['source'].apply(ast.literal_eval)
pd.concat([df.drop(['_source'], axis=1), df['_source'].apply(pd.Series)], axis=1)

But this takes forever in pandas so I want to use polars. duckdb is another option but I feel like sql would get very complicated quickly. Open to suggestions though.

For the first step of converting the string to a dict, I tried this in polars:

df = pl.read_csv(csv_file)
df = df.with_columns(pl.col("_source").apply(ast.literal_eval))

But I get this error:

ComputeError: ComputeError: got mixed dtypes while constructing List Series

From the error it seems that it can't handle what is returned by ast.literal_eval. As I said above, the keys are all the same and the values have the same format so I'm not sure why it is throwing this error. Is there any other way to do this I might be missing? Why can pandas handle this command but polars cannot?

Also not sure what the command would be to convert the dict into new columns.

Update I realised it would be possible to get the data in the form of a ndjson file with properly formatted json in the _source column. Trying read_json crashed my server and using scan_ndjson resulted in the following Compute error: ComputeError: expected list/array in json value, got str

I'm sure I could get to the bottom of this but instead of spending more time on it I've gone back to loading the csv into an eager dataframe and then just using string filtering to extract the data I need like so: deleted = df.filter(pl.col('_source').str.contains("'state': 'deleted'"))

Upvotes: 2

Views: 799

Answers (2)

Hans
Hans

Reputation: 520

Have you thought about joining the strings and use literal_eval to do them all at once as a list? Maybe like this:

import ast
import pandas as pd
dummydata=['{"update_d":"xxxxx"}','{"update_d":"xxxxx22"}']
df=pd.DataFrame(dummydata)
tmpdata="[" + ','.join(df[0].__array__()) +']'
df.loc[:,0] = ast.literal_eval(tmpdata)

Upvotes: 0

ignoring_gravity
ignoring_gravity

Reputation: 10531

You're probably looking for .str.json_decode, e.g.:

In [36]: df = pl.DataFrame({'_source': ['{"update_time": "2023-01-01", "value": 3}']*2})

In [37]: df
Out[37]:
shape: (2, 1)
┌───────────────────────────────────┐
│ _source                           │
│ ---                               │
│ str                               │
╞═══════════════════════════════════╡
│ {"update_time": "2023-01-01", "v… │
│ {"update_time": "2023-01-01", "v… │
└───────────────────────────────────┘

In [38]: df.select(pl.col('_source').str.json_decode()).unnest('_source')
Out[38]:
shape: (2, 2)
┌─────────────┬───────┐
│ update_time ┆ value │
│ ---         ┆ ---   │
│ str         ┆ i64   │
╞═════════════╪═══════╡
│ 2023-01-01  ┆ 3     │
│ 2023-01-01  ┆ 3     │
└─────────────┴───────┘

Upvotes: 3

Related Questions