manojpt
manojpt

Reputation: 1269

extract host name from url using apache pig

For example,

 url: https://pig.apache.org/docs/r0.14.0/func.html
 url: http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

URL is not just limited to the above examples. I would want to extract host name as:

 host_name : pig.apache.org

Could any one help me out.

Upvotes: 1

Views: 749

Answers (2)

Ben Watson
Ben Watson

Reputation: 5541

You're actually looking to extract host names, not domain names. pig.apache.org is a host name, apache.org is the domain name.

Luckily the nice people at Pig have written a UDF to do this. Simply use the Host Extractor UDF as such:

host = FOREACH row GENERATE org.apache.pig.piggybank.evaluation.util.apachelogparser.HostExtractor(referer);

The API docs can be found at: https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/evaluation/util/apachelogparser/HostExtractor.html

Upvotes: 4

pwilmot
pwilmot

Reputation: 596

Sounds like what you are trying to do is run a regex on each url to extract the hostname. This should be something like :

splt = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*\://(?:www.)?([^\/]+)',1))

Upvotes: 0

Related Questions