Szabolcs Magyar
Szabolcs Magyar

Reputation: 409

Python how to parse 2 URLs from a string and then map it back?

I have a column in a pandas dataframe where some of the values are in this format: "From https://....com?gclid=... to https://...com". What I would like is to parse only the first URL so that the gclid and other IDs would vanish and I would like to map back that into the dataframe e.g.: "From https://....com to https://...com"

I know that there is a python module called urllib but if I apply that to this string a call a path() on it, it just parses the first URL and then I lose the other part which is as important as the first one.

Could somebody please help me? Thank you!

Upvotes: 0

Views: 70

Answers (1)

furas
furas

Reputation: 142641

If you use DataFrame then use replace() which can use regex to find text like "?.... " (which starts with ? and ends with space - or which starts with ? and have only chars different then space - '\?[^ ]+')

import pandas as pd

df = pd.DataFrame({'text': ["From https://....com?gclid=... to https://...com"]})

df['text'] = df['text'].str.replace('\?[^ ]+', '')

Result

                                     text
0  From https://....com to https://...com

BTW: you can also try more complex regex to make sure it is part of url which starts with http.

df['text'] = df['text'].str.replace('(http[^?]+)\?[^ ]+', '\\1')

I use (...) to catch this url before ?... and I put it back using \\1 (already without ?...)

Upvotes: 1

Related Questions