Reputation: 2752
I do not really know what I need to get this fixed, but I am trying to extract the OS, OS version and brands like iPhone, Macintosh from the following browser useragents:
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.34 (KHTML, like Gecko) Dooble/1.40 Safari/534.34
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A
Mozilla/5.0 (iPhone; U; CPU like Mac OS X) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/4A93 Safari/419.
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1
Mozilla/5.0 (Windows; U; Windows NT 6.1; tr-TR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27
Mozilla/5.0 (Linux; U; Android 2.2.1; zh-tw; HTC_Sensation_S710e Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
Mozilla/5.0 (Windows; U; Windows NT 6.0; nl) AppleWebKit/522.13.1 (KHTML, like Gecko) Version/3.0.2 Safari/522.13.1
Mozilla/5.0 (BlackBerry; U; BlackBerry 9700; en-US) AppleWebKit/534.8+ (KHTML, like Gecko) Version/6.0.0.380 Mobile Safari/534.8+
I do not know if I need match_all, match, replace, split. The strings are not all the same, and I am trying the following regex:
preg_match_all('/\((.*?);|\((.*?)\) AppleWebKit/im', $user_agent, $brandmatch, PREG_PATTERN_ORDER);
Which has this result, which is good:
Macintosh
iPhone
Macintosh
Windows
Linux
Windows
BlackBerry
Windows NT 5.1
preg_match_all('/\(.*?; (.*?)\)/im', $user_agent, $brandmatch, PREG_PATTERN_ORDER);
Which has this result: (I want 1 - 6 to be like 0)
0 => Intel Mac OS X 10_9_3
1 => U; CPU like Mac OS X
2 => U; Intel Mac OS X 10_6_8; de-at
3 => U; Windows NT 6.1; tr-TR
4 => U; Android 2.2.1; zh-tw; HTC_Sensation_S710e Build/FRG83D
5 => U; Windows NT 6.0; nl
6 => U; BlackBerry 9700; en-US
So I tried the following:
preg_match_all('/U; (.*?);/im', $user_agent, $brandmatch, PREG_PATTERN_ORDER);
Which has this result: (It has 2 less than above, which is bad)
0 => Intel Mac OS X 10_6_8
1 => Windows NT 6.1
2 => Android 2.2.1
3 => Windows NT 6.0
4 => BlackBerry 9700
So what I am trying to do is: I want the OS + OS versions. I also tried:
\(.*?; (.*?)\)|U; (.*?);
Which has this result:
0 => Intel Mac OS X 10_9_3
1 => U; CPU like Mac OS X
2 => U; Intel Mac OS X 10_6_8; de-at
3 => U; Windows NT 6.1; tr-TR
4 => U; Android 2.2.1; zh-tw; HTC_Sensation_S710e Build/FRG83D
5 => U; Windows NT 6.0; nl
6 => U; BlackBerry 9700; en-US
So the results I need are:
0 => Intel Mac OS X 10_9_3
1 => CPU like Mac OS X
2 => Intel Mac OS X 10_6_8
3 => Windows NT 6.1
4 => Android 2.2.1
5 => Windows NT 6.0
6 => BlackBerry 9700
Upvotes: 0
Views: 46
Reputation: 785008
You can use this regex:
/^\S+ +\((?:[^;\n]*;)?(?: U; )?([^;)]+)/m
(
then 0 or more characters until a newline or ;
followed by a ;
.U;
)
or ;
is found in matched group #1Upvotes: 2
Reputation: 89557
The branch reset feature may interest you, because it allows several alternatives but each alternative share the same capture groups with the others.
A branch reset is like this:
(?|alternat(ive1)|alternati(ve2)|alternat(ive3)|e(tc.))
You can see four capture groups, but in this construct, the capture groups are the same (so only one capture group is defined and its content depends on the branch that succeeds).
For your problem, you can try to write something like this:
~^[^(]*\((?|[^);]*;(?: U;)? ([^;)]+)|([^)]+))~m
All you need after is to extract the capture group 1
An other way: using the \K
feature
The \K
removes all that has been matched before from the match result. So no need to define capture groups, the whole match can be the result:
~^[^(]*\((?:[^);]*;(?: U;)? \K[^;)]+|\K[^)]+)~m
But there is a lighter way: make the begining of the first alternation optional and remove the second:
^[^(]*\((?:[^);]*;(?: U;)? )?\K[^;)]+~m
Upvotes: 2