Reputation: 458
I have a dataframe containing a column of boolean expressions and I want to make another column that is just a list of the elements of each expression.
EX
Name Exp
A DDDD | LLLL & AAAA
D HHHH | DDDD | JJJJ
O UUUU & FFFF & RRRR
result df:
Name Exp Exp List
A DDDD | LLLL & AAAA ['DDDD','LLLL','AAAA']
D HHHH | DDDD | JJJJ ['HHHH','DDDD','JJJJ']
O UUUU & FFFF & RRRR ['UUUU','FFFF','RRRR']
Upvotes: 1
Views: 86
Reputation: 760
The answer by @jezrael will fail if the Exp
column contains other special characters.
This implementation works if you know the boolean characters will always be either |
or &
:
>>> df = pd.DataFrame({'Name': ['A', 'D', 'O'],
'Exp': ['DDDD | L-LL & AAAA', 'HHHH | DDDD | JJJJ', 'UUUU& FFFF & RRRR']})
>>> df
Name Exp
0 A DDDD | L-LL & AAAA
1 D HHHH | DDDD | JJJJ
2 O UUUU & FFFF & RRRR
>>> df['Exp List'] = df['Exp'].str.split(r'\s*\||\s*&|\||\&')
>>> df
Name Exp Exp List
0 A DDDD | L-LL & AAAA [DDDD, L-LL, AAAA]
1 D HHHH | DDDD | JJJJ [HHHH, DDDD, JJJJ]
2 O UUUU & FFFF & RRRR [UUUU, FFFF, RRRR]
Upvotes: 1
Reputation: 862851
Use Series.str.findall
with regex [a-zA-Z]+
for extract words:
df['Exp List'] = df['Exp'].str.findall(r'[a-zA-Z]+')
#alternative
#df['Exp List'] = df['Exp'].str.findall(r'\w+')
print (df)
Name Exp Exp List
0 A DDDD | LLLL & AAAA [DDDD, LLLL, AAAA]
1 D HHHH | DDDD | JJJJ [HHHH, DDDD, JJJJ]
2 O UUUU & FFFF & RRRR [UUUU, FFFF, RRRR]
Solution with Series.str.split
with escaped separators with optional whitespaces is:
df['Exp List'] = df['Exp'].str.split(r'\s*\|\s*|\s*&\s*')
Upvotes: 5