Reputation: 2055
I am having some trouble to exact the string from URL using re library.
here's an example:
http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd"
I have a dataframe and i want to add a column using a value from another column, in this example df['URL_REG'] contains: '123'?
df['URL_REG'] = df['URL'].map(lambda x : re.findall(r'[REGEX]+', x)[0])
the structure of URL can change but the part that i want comes always between 'direction=vente.aspx%3pid%' and '%'.
Upvotes: 0
Views: 624
Reputation: 210982
Use vectorized Series.str.extract() method:
In [50]: df['URL_REG'] = df.URL.str.extract(r'direction=vente.aspx\%3pid\%([^\%]+)\%*',
expand=False)
In [51]: df
Out[51]:
URL URL_REG
0 http://www.example.it/remoteconnexion.aspx?u=x... xx123
UPDATE:
i want only '123' part instead of 'xx123', where 'xx' is a hexademical number
In [53]: df['URL_REG'] = df.URL.str.extract(r'direction=vente.aspx\%3pid\%\w{2}(\d+)\%*',
expand=False)
In [54]: df
Out[54]:
URL URL_REG
0 http://www.example.it/remoteconnexion.aspx?u=x... 123
Upvotes: 2
Reputation: 9267
You can use this pattern:
import re
url='http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd'
output = re.findall('3pid%(.*?)%', url)
print(output)
Output:
['xx123']
Then apply the same pattern to your DataFrame.
For example:
import pandas as pd
import re
df = pd.DataFrame(['http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd'], columns = ['URL'])
output = df['URL'].apply(lambda x : re.findall('3pid%(.*?)%', x))
print(output)
# Or, maybe if you want to return the url and the data captured:
# output = df['URL'].apply(lambda x : (x, re.findall('3pid%(.*?)%', x)))
# output[0]
# >>> ('http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd',
# ['xx123'])
Output:
0 [xx123]
Name: URL, dtype: object
Upvotes: 0