Reputation: 1269
For example,
url: https://pig.apache.org/docs/r0.14.0/func.html
url: http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html
URL is not just limited to the above examples. I would want to extract host name as:
host_name : pig.apache.org
Could any one help me out.
Upvotes: 1
Views: 749
Reputation: 5541
You're actually looking to extract host names, not domain names. pig.apache.org
is a host name, apache.org
is the domain name.
Luckily the nice people at Pig have written a UDF to do this. Simply use the Host Extractor UDF as such:
host = FOREACH row GENERATE org.apache.pig.piggybank.evaluation.util.apachelogparser.HostExtractor(referer);
The API docs can be found at: https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/evaluation/util/apachelogparser/HostExtractor.html
Upvotes: 4
Reputation: 596
Sounds like what you are trying to do is run a regex on each url to extract the hostname. This should be something like :
splt = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*\://(?:www.)?([^\/]+)',1))
Upvotes: 0