Reputation: 245
I have the following dataframe:
A
url/3gth33/item/PO151302
url/3jfj6/item/S474-3
url/dfhk34j/item/4964114989191
url/sdfkj3k4/place/9b81f6fd
url/as3f343d/thing/ecc539ec
I'm looking to extract anything with /item/ and its subsequent value.
The end result should be:
item
/item/PO151302
/item/S474-3
/item/4964114989191
here is what I've tried:
df['A'] = df['A'].str.extract(r'(/item/\w+\D+\d+$)')
This is returning what I need except the integer only values.
Based on the regex docs I'm reading this should grab all instances.
What am I missing here?
Upvotes: 0
Views: 93
Reputation: 13447
This is not a regex solution but it could come handy in some situations.
keyword = "/item/"
df["item"] = ((keyword + df["A"].str.split(keyword).str[-1]) *
df["A"].str.contains(keyword))
which returns
A item
0 url/3gth33/item/PO151302 /item/PO151302
1 url/3jfj6/item/S474-3 /item/S474-3
2 url/dfhk34j/item/4964114989191 /item/4964114989191
3 url/sdfkj3k4/place/9b81f6fd
4 url/as3f343d/thing/ecc539ec
5
And in case you want only the rows where item is not empty you could use
df[df["item"].ne("")][["item"]]
Upvotes: 0
Reputation:
Use /item/.+
to match /item/
and anything after. Also, if you put ?P<foo>
at the beginning of a group, e.g. (?P<foo>...)
, the column for that matched group in the returned dataframe of captures will be named what's inside the <...>
:
item = df['A'].str.extract('(?P<item>/item/.+)').dropna()
Output:
>>> item
item
0 /item/PO151302
1 /item/S474-3
2 /item/4964114989191
Upvotes: 2