xsrg45
xsrg45

Reputation: 47

Data Preprocessing in Python using Pandas

I am trying to preprocess one of my columns in my Data frame. The issue is that I have [[ content1] , [content2], [content3]] in the relations column. I want to remove the Brackets

i have tried this following:

df['value'] = df['value'].str[0]

the output that i get is [content 1]

df
print df

id     value                 
1      [[str1],[str2],[str3]]        
2      [[str4],[str5]]       
3      [[str1]]        
4      [[str8]]       
5      [[str9]]      
6      [[str4]]

the expected output should be like

id     value                 
1      str1,str2,str3        
2      str4,str5       
3      str1        
4      str8       
5      str9      
6      str4

Upvotes: 1

Views: 520

Answers (3)

Dragan Kojić
Dragan Kojić

Reputation: 66

You can use useful regex python package re. This is the solution.

import pandas as pd
import re

make the test data

    data = [
        [1, '[[str1],[str2],[str3]]'], 
        [2, '[[str4],[str5]]'], 
        [3, '[[str1]]'], 
        [4, '[[str8]]'], 
        [5, '[[str9]]'], 
        [6, '[[str4]]']
    ]

conver data to Dataframe

    df = pd.DataFrame(data, columns = ['id', 'value'])
    print(df)

enter image description here

remove '[', ']' from the 'value' column

    df['value']=df.apply(lambda x: re.sub("[\[\]]", "", x['value']),axis=1)
    print(df)

enter image description here

Upvotes: 0

Karn Kumar
Karn Kumar

Reputation: 8816

As I could see, your data and sampling the same:

Sample Data:

df = pd.DataFrame({'id':[1,2,3,4,5,6], 'value':['[[str1],[str2],[str3]]', '[[str4],[str5]]', '[[str1]]',  '[[str8]]', '[[str9]]', '[[str4]]']})
print(df)
   id                   value
0   1  [[str1],[str2],[str3]]
1   2         [[str4],[str5]]
2   3                [[str1]]
3   4                [[str8]]
4   5                [[str9]]
5   6                [[str4]]

Result:

df['value'] = df['value'].str.replace('[', '').astype(str).str.replace(']', '')
print(df)
   id           value
0   1  str1,str2,str3
1   2       str4,str5
2   3            str1
3   4            str8
4   5            str9
5   6            str4

Note: as the error code says AttributeError: Can only use .str accessor with string values which means it's not treating it as str hence you may cast it to str by astype(str) and then do the replace operation.

Upvotes: 0

mozway
mozway

Reputation: 260975

It looks like you have lists of lists. You can try to unnest and join:

df['value'] = df['value'].apply(lambda x: ','.join([e for l in x for e in l]))

Or:

from itertools import chain
df['value'] = df['value'].apply(lambda x: ','.join(chain.from_iterable(x)))

NB. If you get an error, please provide it and the type of the column (df.dtypes)

Upvotes: 1

Related Questions