Reputation: 409
I have a column in a pandas dataframe where some of the values are in this format: "From https://....com?gclid=... to https://...com". What I would like is to parse only the first URL so that the gclid and other IDs would vanish and I would like to map back that into the dataframe e.g.: "From https://....com to https://...com"
I know that there is a python module called urllib but if I apply that to this string a call a path() on it, it just parses the first URL and then I lose the other part which is as important as the first one.
Could somebody please help me? Thank you!
Upvotes: 0
Views: 70
Reputation: 142641
If you use DataFrame then use replace()
which can use regex to find text like "?.... "
(which starts with ?
and ends with space
- or which starts with ?
and have only chars different then space
- '\?[^ ]+'
)
import pandas as pd
df = pd.DataFrame({'text': ["From https://....com?gclid=... to https://...com"]})
df['text'] = df['text'].str.replace('\?[^ ]+', '')
Result
text
0 From https://....com to https://...com
BTW: you can also try more complex regex to make sure it is part of url which starts with http
.
df['text'] = df['text'].str.replace('(http[^?]+)\?[^ ]+', '\\1')
I use (...)
to catch this url before ?...
and I put it back using \\1
(already without ?...
)
Upvotes: 1