Penguin
Penguin

Reputation: 2411

Convert a dataframe with string representations of lists to a dataframe with lists

My pandas dataframe somehow got messed up. There are 2 columns in it that were supposed to contain lists, but now they contain strings of lists:

id.       array

72        [ 2.2545414  -0.8302277  -9.557333    1.944972...
73        [ 3.0519443   1.2425094  -1.7121094   0.394222...
74        [ 2.9175313   1.0301533  -1.0083416   1.545938...
77        [-8.521629    3.2176793   2.5869853   1.399137...

id.       names_arrays

72        ['T恤', '外套', '夹克', '衬衣', '领带', '衬衫', '围巾', '粉色...
73        ['济科', '外画', '段萍', '泰舍', '萎缩性', '祝丹妮', '大京', '...
74        ['秀场', '时装周', '时装秀', '舞台', '红毯', '时装设计', '复古风'...

You can't see it on the dataframe itself, but when I print:

np.array(df['array'][:1])[0]

I get

'[ 2.2545414  -0.8302277  -9.557333    1.9449722   3.7186048   5.790459\n  0.07255215  1.3358237  -2.9177604   4.03371    -1.4177471  -1.2400303\n  2.5485678   1.0194561   0.14744097 -1.0286134   2.1207867  -1.6046501\n  3.640595   11.30236     0.98157316 -4.8968134  -0.80825585 -2.9547403\n  8.363517   -0.7563907   0.590438    0.14872111  0.28678164 -4.1656523\n  0.21350707  2.7396295  -0.86256826 -3.0678177  -2.2119153  -3.3205476\n  1.7437696  -3.5955458  -3.811455   -2.4635699   2.3464768   3.774634\n]'

And the other column:

np.array(df['names_arrays'][:1])[0]
>>> "['T恤', '外套', '夹克', '衬衣', '领带', '衬衫', '围巾', '粉色', '纽扣', '球鞋']"

I found this to be useful for the names_arrays column

literal_eval(np.array(df['names_arrays'][:1])[0])
>>> ['T恤', '外套', '夹克', '衬衣', '领带', '衬衫', '围巾', '粉色', '纽扣', '球鞋']

But 1. I'm not sure how to do it for the entire dataframe (rather than a single row) and 2. this doesn't work for the column array as it doesn't have commas in between the numbers, and also there are \n in between sometimes

Upvotes: 0

Views: 141

Answers (1)

RJ Adriaansen
RJ Adriaansen

Reputation: 9619

You can use applymap on a custom function:

import pandas as pd

data = [('[ 2.2545414  -0.8302277  -9.557333    1.944972]', "['T恤', '外套', '夹克', '衬衣', '领带', '衬衫', '围巾', '粉色']"), ('[ 3.0519443   1.2425094  -1.7121094   0.394222]', "['济科', '外画', '段萍', '泰舍', '萎缩性', '祝丹妮', '大京']")]
df = pd.DataFrame(data, columns=['array', 'names_arrays'])

def fix_lists(text):
    return text.replace('[', '').replace(']', '').replace(',', ' ').replace("'", '').split()

df = df.applymap(fix_lists)

df['array'][0][0] will return 2.2545414, and df['names_arrays'][0][0] T恤.

Upvotes: 1

Related Questions