Reputation: 4842
I am new to regex expressions.
I am trying to filter out values from 2 columns.
The first column look like this:
_source.cookie
__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_path=https://google.com/free-test/windows/; country_code=OM; clid=06a98eb3-177a-4692-8a15-04cb4c084c1c; ct_t=1574680122; ct_tid=1574680122; _ga=GA1.2.575812751.1574680122; _gid=GA1.2.560773616.1574680122; _gac_UA-138885843-1=1.1574680161.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gat=1; _gcl_aw=GCL.1574680123.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gcl_au=1.1.1227740955.1574680123; sessionid=yr0pycyfhjh90vauf0z8yw4kxno5rom0; u_id=22b5d5e0-d2b5-4a4a-ad6f-128008b4b466; _gat_UA-138885843-1=1
...
__cfduid=de7d3a7e772a62b171f445ce489bc5f791574680110; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; ct_tid=1574680121; _ga=GA1.2.476582918.1574680125; _gid=GA1.2.1129397609.1574680125; _gat=1; _gcl_au=1.1.356653701.1574680128; _gat_UA-138885843-1=1; clid=3d0b5be5-8b7b-4094-ba47-879252a59a7a; ct_t=1574680159; _gcl_aw=GCL.1574680162.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; _gac_UA-138885843-1=1.1574680169.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE
__cfduid=d3b31d4cba74d440bf60e238a62bf46a51574680162; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; full_path=https://google.com/au/best-test/; country_code=AU; clid=4e65772c-5da2-471a-86dd-240a34fd36ac; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.242059245.1574680165; _gid=GA1.2.1757216414.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gcl_au=1.1.1892979809.1574680165
__cfduid=d054c8a93d4874e31aef9f2966829fefc1574680166; gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; clid=726ebc25-95b9-4507-b29d-998ab54a9eeb; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.1271977185.1574680165; _gid=GA1.2.506750010.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gcl_au=1.1.24394228.1574680165
__cfduid=d27ba2095c6b6ac5fb6108343075969f11574679826; full_path=https://google.com/reviews/testtest/; country_code=VN; ct_tid=1574679826; _ga=GA1.2.2008368313.1574679827; _gid=GA1.2.1231813533.1574679827; _gcl_au=1.1.299737663.1574679827; sessionid=dqwf1zmqdjkv9tdqi1cotr6m2judep2p; u_id=a71d0a87-b93d-4626-8f51-bcc0550dbbee; gclid=EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; clid=aeb5b4d0-400b-47ee-b916-69a7b03544aa; ct_t=1574680166; _gac_UA-138885843-1=1.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; _gat=1; _gcl_aw=GCL.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE
The second column look like this:
_source.request_url
https://google.com/go/test/?p3
https://google.com/au/test/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE
https://google.com/go/test/
...
https://google.com/api/dto/?click_type=gclid&click_id=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&click_src=GET&cid=242059245.1574680165&user_id=&landing_page_uri=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F%3Fgclid%3DCjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&landing_page_referer=&lpu=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F&lpr=https%3A%2F%2Fwww.google.com%2F&trigger=onLoad>mon=true&gaon=true&cookieon=true&ct_t=1574680164&ct_tid=1574680164&v=20191029&_=1574680164139
My goal is to extract glid values from both columns so that I would have 2 new columns Glid from cookie
and Gclid from URL
What I have so far:
def get_glid_from_source(pattern, data):
result = re.search(pattern, str(data))
if result is not None:
return result.group(1)
return None
df['Gclid_from_url'] = df.apply(lambda x: get_glid_from_source('[gclid|click_id]=(.+?)&', x['_source.request_url']), axis=1)
df['Gclid_from_cookie'] = df.apply(lambda x: get_glid_from_source('gclid=(.+?);', x['_source.cookie']), axis=1)
I would need to edit the expression so that:
1. Gclid can only start with a letter a-z or A-Z
2. Gclid ends with one of the following - ;, %, & or the end of string
Now after filtering I get values wfrom the second column that have click_id=ZFGe...
because the value I need can be either gclid=Value I need
or gclid&click_id= Value I need
EDIT
I have a third column which looks like this:
_source.request_url
www.google.com/api/test...
www.google.com/go/test...
www.google.com/fire-start.php/test...
www.google.com/test...
www.google.com/api/test...
I am making new column in the pandas dataframe where I add TRUE
or FALSE
values depending if these conditions are met in the above column:
If the link has:
.com/fire-start.php or .com/go/ or .com/api/
New column would have value as FALSE
if the patter is not found in the string 'TRUE' value is passed.
What I tried:
df['validate'] = df['_source.request_url'].str.extract(r'(www.google)=([a-zA-Z][^.com/fire-start.php|^.com/go/|^.com/api/]*)')
But that does not seem to work.
Thank you for your help, appreciate it.
Upvotes: 1
Views: 255
Reputation: 626794
You may use
df['Gclid_from_url'] = df['_source.request_url'].str.extract(r'(?:gclid|click_id)=([a-zA-Z][^&#]*)')
See the regex demo
The Gclid_from_cookie
can be populated using
df['Gclid_from_cookie'] = df['_source.cookie'].str.extract(r'gclid=([a-zA-Z][^&#;%]*)')
See this regex demo
Note that [gclid|click_id]
matches any 1 char defined in the character set, a g
, c
, l
, i
, d
, |
, k
or _
, not a sequence of chars, hence the non-capturing group (?:...)
in my pattern.
The value pattern is [a-zA-Z][^&#]*
or [a-zA-Z][^&#;%]*
that is quite self-explanatory: an ASCII letter and 0 or more chars other than &
, #
, ;
, %
.
As far as the updated part of the question is concerned, you need to understand that a negated character class matches a single char, not a sequence of chars, you can't "say" [^not]
to match any text but not
, [^not]
matches any char but n
, o
and t
.
Add
import re
filters = ['.com/fire-start.php', '.com/go/', '.com/api/']
df['validate']=df['_source.request_url'].str.contains("|".join(map(re.escape,filters)))
Upvotes: 2