MOTIVECODEX
MOTIVECODEX

Reputation: 2752

Regex and / or / exlude / include with PHP

I do not really know what I need to get this fixed, but I am trying to extract the OS, OS version and brands like iPhone, Macintosh from the following browser useragents:

Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.34 (KHTML, like Gecko) Dooble/1.40 Safari/534.34
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A
Mozilla/5.0 (iPhone; U; CPU like Mac OS X) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/4A93 Safari/419.
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1
Mozilla/5.0 (Windows; U; Windows NT 6.1; tr-TR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27
Mozilla/5.0 (Linux; U; Android 2.2.1; zh-tw; HTC_Sensation_S710e Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
Mozilla/5.0 (Windows; U; Windows NT 6.0; nl) AppleWebKit/522.13.1 (KHTML, like Gecko) Version/3.0.2 Safari/522.13.1
Mozilla/5.0 (BlackBerry; U; BlackBerry 9700; en-US) AppleWebKit/534.8+ (KHTML, like Gecko) Version/6.0.0.380 Mobile Safari/534.8+

I do not know if I need match_all, match, replace, split. The strings are not all the same, and I am trying the following regex:

preg_match_all('/\((.*?);|\((.*?)\) AppleWebKit/im', $user_agent, $brandmatch, PREG_PATTERN_ORDER);

Which has this result, which is good:

Macintosh
iPhone
Macintosh
Windows
Linux
Windows
BlackBerry
Windows NT 5.1

preg_match_all('/\(.*?; (.*?)\)/im', $user_agent, $brandmatch, PREG_PATTERN_ORDER);

Which has this result: (I want 1 - 6 to be like 0)

0   =>  Intel Mac OS X 10_9_3
1   =>  U; CPU like Mac OS X
2   =>  U; Intel Mac OS X 10_6_8; de-at
3   =>  U; Windows NT 6.1; tr-TR
4   =>  U; Android 2.2.1; zh-tw; HTC_Sensation_S710e Build/FRG83D
5   =>  U; Windows NT 6.0; nl
6   =>  U; BlackBerry 9700; en-US

So I tried the following:

preg_match_all('/U; (.*?);/im', $user_agent, $brandmatch, PREG_PATTERN_ORDER);

Which has this result: (It has 2 less than above, which is bad)

0   =>  Intel Mac OS X 10_6_8
1   =>  Windows NT 6.1
2   =>  Android 2.2.1
3   =>  Windows NT 6.0
4   =>  BlackBerry 9700

So what I am trying to do is: I want the OS + OS versions. I also tried:

\(.*?; (.*?)\)|U; (.*?);

Which has this result:

0   =>  Intel Mac OS X 10_9_3
1   =>  U; CPU like Mac OS X
2   =>  U; Intel Mac OS X 10_6_8; de-at
3   =>  U; Windows NT 6.1; tr-TR
4   =>  U; Android 2.2.1; zh-tw; HTC_Sensation_S710e Build/FRG83D
5   =>  U; Windows NT 6.0; nl
6   =>  U; BlackBerry 9700; en-US

So the results I need are:

0   =>  Intel Mac OS X 10_9_3
1   =>  CPU like Mac OS X
2   =>  Intel Mac OS X 10_6_8
3   =>  Windows NT 6.1
4   =>  Android 2.2.1
5   =>  Windows NT 6.0
6   =>  BlackBerry 9700

Upvotes: 0

Views: 46

Answers (2)

anubhava
anubhava

Reputation: 785008

You can use this regex:

/^\S+ +\((?:[^;\n]*;)?(?: U; )?([^;)]+)/m

RegEx Demo

  • First it matches everything upto first space
  • Then it matches ( then 0 or more characters until a newline or ; followed by a ;.
  • Then It optional matches U;
  • It captures everything until a ) or ; is found in matched group #1
  • See demo for more details

Upvotes: 2

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

The branch reset feature may interest you, because it allows several alternatives but each alternative share the same capture groups with the others.

A branch reset is like this:

(?|alternat(ive1)|alternati(ve2)|alternat(ive3)|e(tc.))

You can see four capture groups, but in this construct, the capture groups are the same (so only one capture group is defined and its content depends on the branch that succeeds).

For your problem, you can try to write something like this:

~^[^(]*\((?|[^);]*;(?: U;)? ([^;)]+)|([^)]+))~m

demo

All you need after is to extract the capture group 1


An other way: using the \K feature

The \K removes all that has been matched before from the match result. So no need to define capture groups, the whole match can be the result:

~^[^(]*\((?:[^);]*;(?: U;)? \K[^;)]+|\K[^)]+)~m

demo


But there is a lighter way: make the begining of the first alternation optional and remove the second:

^[^(]*\((?:[^);]*;(?: U;)? )?\K[^;)]+~m

demo

Upvotes: 2

Related Questions