Reputation: 33
I have a few logs like below
endeavor.fujitsu.co.jp - - [10/Jul/1995:00:00:15 -0400] "GET /images/ HTTP/1.0" 200 17688
ad13-022.compuserve.com - - [10/Jul/1995:00:00:15 -0400] "GET /history/gemini/gemini-spacecraft.txt HTTP/1.0" 200 651
pm2-15.magicnet.net - - [10/Jul/1995:00:00:15 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713
204.239.199.40 - - [10/Jul/1995:00:00:16 -0400] "GET /shuttle/missions/sts-71/images/KSC-95EC-0613.gif HTTP/1.0" 200 45970
pm1-4.tricon.net - - [10/Jul/1995:00:00:17 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669
scorpio.digex.net - - [10/Jul/1995:00:00:19 -0400] "GET /history/mercury/mr-3/mr-3.html HTTP/1.0" 200 1124
I need to extract the paths from the above logs. Here is the code that I tried
val pattern = "\\s+([^\\s]+)\\s+HTTP".r
val match = pattern.findFirstIn(log)
Here is the output that I got.
/images/ HTTP
/history/gemini/gemini-spacecraft.txt HTTP
/images/launch-logo.gif HTTP
/shuttle/missions/sts-71/images/KSC-95EC-0613.gif HTTP
/images/WORLD-logosmall.gif HTTP
/history/mercury/mr-3/mr-3.html HTTP
How do I get rid of HTTP in the above paths?
Upvotes: 2
Views: 160
Reputation: 163362
Your match is in the first capturing group ()
which you might shorten to:
\s(\S+)\s+HTTP
In Scala
val pattern = "\\s(\\S+)\\s+HTTP".r
You might get the logs using findAllIn:
val pattern = "\\s(\\S+)\\s+HTTP".r
val strings = List(
"""endeavor.fujitsu.co.jp - - [10/Jul/1995:00:00:15 -0400] "GET /images/ HTTP/1.0" 200 17688 """,
"""ad13-022.compuserve.com - - [10/Jul/1995:00:00:15 -0400] "GET /history/gemini/gemini-spacecraft.txt HTTP/1.0" 200 651 """,
"""pm2-15.magicnet.net - - [10/Jul/1995:00:00:15 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713 """,
"""204.239.199.40 - - [10/Jul/1995:00:00:16 -0400] "GET /shuttle/missions/sts-71/images/KSC-95EC-0613.gif HTTP/1.0" 200 45970 """,
"""pm1-4.tricon.net - - [10/Jul/1995:00:00:17 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669 """,
"""scorpio.digex.net - - [10/Jul/1995:00:00:19 -0400] "GET /history/mercury/mr-3/mr-3.html HTTP/1.0" 200 1124"""
)
strings.foreach { log =>
val m = pattern.findAllIn(log).group(1)
println(m)
}
Result
/images/
/history/gemini/gemini-spacecraft.txt
/images/launch-logo.gif
/shuttle/missions/sts-71/images/KSC-95EC-0613.gif
/images/WORLD-logosmall.gif
/history/mercury/mr-3/mr-3.html
To also match this line from the comment:
columbia.acc.brad.ac.uk - - [10/Jul/1995:00:52:36 -0400] "GET /ksc.html" 200 7067
You might use:
\S+ (/(?:[^/\s]+/)*[^\s"]+)
Upvotes: 0
Reputation: 37755
You're match is in first capturing group,
Alternatively you can use positive lookahead
\\s+[^\\s]+(?=\\s+HTTP)
Upvotes: 2