M Arroyo
M Arroyo

Reputation: 445

Applying function to dataframe column

I have a pandas dataframe:

  name    sample
1  a      Category 1: qwe, asd (line break) Category 2: sdf, erg
2  b      Category 2: sdf, erg(line break) Category 5: zxc, eru
...
30  p      Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err 

I want to end with:

 name    qwe   asd   sdf   erg   zxc   eru 2134  EFDgh  Pdr tke  err
1  a       1     1     1     1    0     0    0     0       0       0
2  b       0     0     1     1    1     1    0     0       0       0
...
30  p      0    1      0     0    0     0    0     1       1       0

I created the following function:

def cleanattributes(istring):

    istring=str(istring)
    istring=istring.rstrip().split('\\n')

    counter=0
    for line in istring:
        istring[counter]=istring[counter].rpartition(': ')[-1]
        counter+=1
    istring=str(istring)
    istring = istring.replace("'", "")
    istring = istring.replace("\"", "")
    return(str(istring))

This function creates the expected result of returning the category information without the category titles(the idea being to use getdummies to get the columns)

teststring="Category 1: qwe, asd\\nCategory 2: sdf, erg"
cleanattributes(teststring)
OUTPUT: '[qwe, asd, sdf, erg]'

I'm not sure how to best apply this function to each record so that the dataframe looks like this:

  name    sample
1  a      qwe, asd, sdf, erg
2  b      sdf, erg, zxc, eru
...
30  p      asd, 2134, EFDgh, Pdr tke, err 

Or if this is even the best method of approaching this.

As requested:

df['sample'].iat[0]
OUTPUt= 'Category 1: qwe, asd\nCategory 2: sdf, erg'

Upvotes: 1

Views: 135

Answers (1)

Alexander
Alexander

Reputation: 109726

df = pd.DataFrame(
    {'name': ['a', 'b'],
     'sample': ['Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err', 
                'Category 2: sdf, erg\nCategory 5: zxc, eru\nCategory 1: asd, Category PE: 2134, EFDgh, Pdr tke, err']}

df2 = pd.concat([df.name, 
                 df['sample']
                 .str.replace("(Category .*: )+", '')  # Remove "Category [*]:"
                 .str.replace(r'\n', '')  # Remove "\n"
                 .str.split(', ', expand=True)], 
                axis=1)

df3 = pd.melt(df2, id_vars='name')[['name', 'value']]

>>> pd.concat([df3['name'], pd.get_dummies(df3['value'])], axis=1)
   name  2134  EFDgh  Pdr tke  ergzxc  err  eru2134  sdf
0     a     1      0        0       0    0        0    0
1     b     0      0        0       0    0        0    1
2     a     0      1        0       0    0        0    0
3     b     0      0        0       1    0        0    0
4     a     0      0        1       0    0        0    0
5     b     0      0        0       0    0        1    0
6     a     0      0        0       0    1        0    0
7     b     0      1        0       0    0        0    0
8     a     0      0        0       0    0        0    0
9     b     0      0        1       0    0        0    0
10    a     0      0        0       0    0        0    0
11    b     0      0        0       0    1        0    0

Upvotes: 2

Related Questions