Reputation: 9302
I have a file, which contains automatically generated statistical data from apache http logs.
I'm really struggling on how to match lines between 2 sections of text. This is a portion of the stat file I have:
jpg 6476 224523785 0 0
Unknown 31200 248731421 0 0
gif 197 408771 0 0
END_FILETYPES
# OS ID - Hits
BEGIN_OS 12
linuxandroid 1034
winlong 752
winxp 1320
win2008 204250
END_OS
# Browser ID - Hits
BEGIN_BROWSER 79
mnuxandroid 1034
winlong 752
winxp 1320
What I'm trying to do, is write a regex which will only search between the tags BEGIN_OS 12
and END_OS
.
I want to create a PHP array that contains the OS and the hits, for example (I know the actual array won't actually be exactly like this, but as long as I have this data in it):
array(
[0] => array(
[0] => linuxandroid
[1] => winlong
[2] => winxp
[3] => win2008
)
[1] => array(
[0] => 1034
[1] => 752
[2] => 1320
[3] => 204250
)
)
I've been trying for a good couple of hours now with gskinner regex tester to test regular expressions, but regex is far from my strong point.
I would post what I've got so far, but I've tried loads, and the closest one I've got is:
^[BEGIN_OS\s12]+([a-zA-Z0-9]+)\s([0-9]+)
which is pathetically awful!
Any help would be appreciated, even if its a 'It cant be done'.
Upvotes: 0
Views: 65
Reputation: 7181
You might try something like:
/BEGIN_OS 12\s(?:([\w\d]+)\s([\d]+\s))*END_OS/gm
You'll have to parse the match still for your results, You may also simplify it with something like:
/BEGIN_OS 12([\s\S]*)END_OS/gm
And then just parse the first group (the text between them) and split on '\n'
then ' '
to get the parts you desire.
Edit
Regexs with comments:
/BEGIN_OS 12 // Match "BEGIN_OS 12" exactly
\s // Match a whitespace character after
(?: // Begin a non-capturing group
([\w\d]+) // Match any word or digit character, at least 1 or more
\s // Match a whitespace character
([\d]+\s) // Match a digit character, at least one or more
)* // End non-capturing group, repeate group 0 or more times
END_OS // Match "END_OS" exactly
/gm // global search (g) and multiline (m)
And the simple version:
/BEGIN_OS 12 // Match "BEGIN_OS 12" exactly
( // Begin group
[\s\S]* // Match any whitespace/non-whitespace character (works like the '.' but captures newlines
) // End group
END_OS // Match "END_OS" exactly
/gm // global search (g) and multiline (m)
Secondary Edit
Your attempt:
^[BEGIN_OS\s12]+([a-zA-Z0-9]+)\s([0-9]+)
Won't give you the results you expect. If you break it apart:
^ // Match the start of a line, without 'm' this means the beginning of the string.
[BEGIN_OS\s12]+ // This means, match a character that is any [B, E, G, I, N, _, O, S, \s, 1, 2]
// where there is at least 1 or more. While this matches "BEGIN_OS 12"
// it also matches any other lines that contains a combination of those
// characters or just a line of whitespace thanks to \s).
([a-zA-Z0-9]+) // This should match the part you expect, but potentially not with the previous rules in place.
\s
([0-9]+) // This is the same as [\d]+ or \d+ but should match what you expect (again, potentially not with the first rule)
Upvotes: 1
Reputation: 76666
A regular expression may not be the best tool for this job. You can use a regex to get the required substring and then do the further processing with PHP's string manipulation functions.
$string = preg_replace('/^.*BEGIN_OS \d+\s*(.*?)\s*END_OS.*/s', '$1', $text);
foreach (explode(PHP_EOL, $string) as $line) {
list($key, $value) = explode(' ', $line);
$result[$key] = $value;
}
print_r($result);
Should give you the following output:
Array
(
[linuxandroid] => 1034
[winlong] => 752
[winxp] => 1320
[win2008] => 204250
)
Upvotes: 3