Reputation: 591
I am trying to remove a custom list of stop words, but its not working.
desc = pd.DataFrame(description, columns =['description'])
print(desc)
Which gives the following results
description
188693 The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff
11443 According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa...
... ...
2732 The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit...
[9875 rows x 1 columns]
I found the following code here, but it doesn't seem to work
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
desc.assign(new_desc=desc.replace(dict(string={pat: ''}), regex=True))
Which produces the following results
description new_desc
188693 The Kentucky Cannabis Company and Bluegrass He... The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff Ohio County Sheriff
11443 According to new reports from federal authorit... According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou... KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa... The crew of Insight, WCNY's weekly public affa...
... ... ...
2732 The Arkansas Supreme Court on Thursday cleared... The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ... Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres... Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan... Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit... SAN TAN VALLEY — Two men have been charged wit...
9875 rows × 2 columns
As you can see, the stop words weren't removed. Any help you can provide would be greatly appreciated.
Upvotes: 1
Views: 668
Reputation: 38415
Handle the case, simplify pattern,
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join(remove_words)
desc['new_desc'] = desc.description.str.lower().replace(pat,'', regex=True)
description new_desc
0 The Kentucky Cannabis Company and Bluegrass He... the kentucky company and bluegrass he...
1 Ohio County Sheriff ohio county sheriff
2 According to new reports from federal authorit... according to new reports from federal authorit...
3 KANSAS CITY, Mo. (AP)The Chiefs will be mariju... kansas city, mo. (ap)the chiefs will be witho...
4 The crew of Insight, WCNY's weekly public affa... the crew of insight, wcny's weekly public affa...
Upvotes: 4