Reputation: 2022
I have a dataframe which has lines as below in a single column:
__label__JCB_Spare_Part __label__Differential_Housings jcb casting assy differential housing
__label__Vibrating_Roller __label__Road_Roller double drum mini roller seat drive model fyl engine nbsp hp aircolled diesel engine wheel size walk speed km climbing capacity drive hydrostatic drive nbsp nbsp
__label__Vibrating_Roller __label__Road_Roller double drum mini roller seat drive model fyl engine nbsp hp aircolled diesel engine wheel size walk speed km climbing capacity drive hydrostatic drive nbsp nbsp
__label__Crawler_Dozer __label__Bulldozer dozer bulldozer
__label__Crawler_Dozer __label__Bulldozer dozer bulldozer
I wish to extract all the words with prefix equal to __label__
in a separate column as below:
__label__JCB_Spare_Part __label__Differential_Housings
__label__Vibrating_Roller __label__Road_Roller
__label__Vibrating_Roller __label__Road_Roller
__label__Crawler_Dozer __label__Bulldozer
__label__Crawler_Dozer __label__Bulldozer
What I have tried:
labels = input[0].str.extract(r'(__label__[\w]+)')
but it only pulls out a single first label.
Upvotes: 0
Views: 55
Reputation: 19885
Your code is mostly correct; it's just that you want findall
instead:
labels = input[0].str.findall(r'(__label__[\w]+)')
Upvotes: 1
Reputation: 204
You can try this:
import re
str = """
__label__JCB_Spare_Part __label__Differential_Housings jcb casting assy differential housing
__label__Vibrating_Roller __label__Road_Roller double drum mini roller seat drive model fyl engine nbsp hp aircolled diesel engine wheel size walk speed km climbing capacity drive hydrostatic drive nbsp nbsp
__label__Vibrating_Roller __label__Road_Roller double drum mini roller seat drive model fyl engine nbsp hp aircolled diesel engine wheel size walk speed km climbing capacity drive hydrostatic drive nbsp nbsp
__label__Crawler_Dozer __label__Bulldozer dozer bulldozer
__label__Crawler_Dozer __label__Bulldozer dozer bulldozer
"""
result = re.findall('__label__\w+', str)
Upvotes: 0