HRDSL
HRDSL

Reputation: 761

Creating a pandas DataFrame column whose content is a set

I'm having a pandas issue.

So this is my DataFrame:

user    page_number   page_parts_of_speech
Anne    1             [('Hi', NP), ('my', PP), ('name', NN), ('is', VB), ('Anne', NP)]
John    2             [('Hi', NP), ('my', PP), ('name', NN), ('is', VB), ('John', NP)]

And I want to add a new column, called set_of_parts_of_speech, which contains a set that contains all words in the parts_of_speech column that are tuppled together with an NP.

A sample output would be:

    user    page_number   page_parts_of_speech    set_of_parts_of_speech                           
    Anne    1             [('Hi', NP), ('my', PP),  ['Hi', 'Anne']
    ('name', NN), ('is', VB), ('Anne', NP)]
    John    2             [('Hi', NP), ('my', PP),  ['Hi', 'John']
    ('name', NN), ('is', VB), ('John', NP)]

It is really important that the set_of_parts_of_speech column contains an actual set.

Any help on this issue will be highly appreciated.

Upvotes: 1

Views: 40

Answers (1)

jezrael
jezrael

Reputation: 862511

Use apply with list comprehension for filtering by condition:

print (type(df.loc[0, 'page_parts_of_speech']))
<class 'list'>

f = lambda x: set([y[0] for y in x if y[1] == 'NP'])
df['set_of_parts_of_speec'] = df['page_parts_of_speech'].apply(f)
print (df)
   user  page_number                               page_parts_of_speech  \
0  Anne            1  [(Hi, NP), (my, PP), (name, NN), (is, VB), (An...   
1  John            2  [(Hi, NP), (my, PP), (name, NN), (is, VB), (Jo...   

  set_of_parts_of_speec  
0            {Hi, Anne}  
1            {Hi, John}  

Upvotes: 2

Related Questions