Reputation: 87
I have a dataset which has a variable which has urls as its observations. I am trying to create another variable which would list the type of domain for the observation in the "url" variable (.com
, .org
, .co.uk
etc.)
I could split the "url" variable by parsing using "."
split url, p(.)
but that would not definitively give me the domain name.
The problem arises due to the high variance in the type of "url"s For. eg.
www.google.com
would be split into 3 variables, http://www.nih.nlm.gov
would be split into 4www.yahoo.com
is split into 3, https://www.movies.yahoo.co.au
would be split into 5.How can i write the following formula in stata to create the "domain type" variable from the "url" variable
- if the part after the last "." in the "url" variable has ≥ 3 characters (.com/.edu/.org/.gov or .info) then use this as domain type
- if the part after the last "." in the "url" variable has < 3 characters ( .uk/.au/.tv etc.) AND the part before the last "." has ≤ 2 characters (.co ), then use the part after the penultimate "." as domain type (i.e. .co.uk)
- if the value after the last "." in the "url" variable has < 3 characters ( .us domains) AND the part before the last "." has > 2 characters, then use the part after the last "." as domain type (e.g freeshootinggames.us)
Also, is there another way of doing this ?
I am working in Stata 13.1 on Windows 8 Pro x64
Thanks !!
Upvotes: 1
Views: 2095
Reputation: 9470
Reversing strings is a useful trick in problems like this. Try something like this:
gen rev_url = reverse(url)
split rev_url, parse(.) gen(domain_)
replace domain_1 = reverse(domain_1)
replace domain_2 = reverse(domain_2)
replace domain_1 = domain_2 + "." + domain_1 if length(domain_2)<=2 & length(domain1)<3
rename domain_1 domain
drop domain_* rev_url
Upvotes: 2