user2270911
user2270911

Reputation: 305

SAS Perl Regular Expressions: How to write correct syntax?

I have some complicated string parsing which would be very difficult to accomplish using regular SAS functions because of the string value inconsistency; as a result I think I will need to use Perl Regular Expressions. Below have 4 variables (price, date, size, bundle) which I have to create using parts of the text string. I'm have trouble getting the syntax correct - I am new to regular expressions.

Here is a sample data set.

data have;
infile cards truncover;
  input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;run;

/The first variable is price it is normally located near the end or middle of the string/

data want;
set have;
  price =(input(prxchange('s/(\w+)_(\d+)_(\w+)/$2/',-1,text),8.))/100;
  format price dollar8.2;
run;

Using the data set above I need to have this result:

price 0 79.99 89.99 89.99 79.99 64.99

/Date is always a series of consecutive digits. Either 6, 7 or 8. Using | which means 'or' I thought I would be able to pull that way/

data want;
set have;
date=prxparse('/\d\d\d\d\d\d|\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d/',text);
run;

Using the data set above I need to have this result:

Date 1192014 112014 2102014 272014 12252014 462014 1192014 12162013

/* For size there is always an ‘x’ in the middle of the sub-string which is with followed by two or three digits on either side*/

data want;
set have;
  size=prxparse('/(\w+)_(\d+)'x'(\d+)_(\w+)/',text);
run;

Size 728x90 160x600 300x250 160x600 728x90

/*This is normally located towards the beginning of the string. It’s always a single digit number followed by an x It in never followed by additional digits but can also be just 0. */

data want;
set have;
  Bundle=prxparse('/(\d+)'x'',text);
run;

Bundle 0 3x 3x 3X 3x 0 2x 3x

The final product I am looking for should look like this:

Text    Date    price   Size    Bundle
acq_newsale_0_CartChat_0_Flash_1192014.jpg  1192014 0       0
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf  112014  79.99       3x
acq_sale_3xconoffer_8999_nacpg_2102014.sfw  2102014 89.99       3x
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp    272014  89.99   728x90  3X
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov   12252014        160x600 3x
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg 462014      300x250 0
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf  1192014 79.99   160x600 2x
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf   12162013    64.99   728x90  3

x

Upvotes: 0

Views: 734

Answers (1)

Joe
Joe

Reputation: 63434

If you're extracting, don't use PRXCHANGE. Use PRXPARSE, PRXMATCH, and PRXPOSN.

Sample usage, with date:

data have;
infile cards truncover;
  input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;
run;

data want;
set have;
rx_date = prxparse('~(\d{6,8})~io');
rc_date = prxmatch(rx_date,text);
if rc_date then datevar = prxposn(rx_date,1,text);
run;

Just enclose in parens the section you want to extract (in this case, all of it).

Date was easy - as you say, 6-8 numbers. The others may be harder. The 3x etc. bit you can probably find, depending on how strict you need to be; the price I think you'll have a very hard time finding. You need to be able to better articulate the rules. "Towards the beginning" isn't a regex rule. "The second set of digits" is; "The second to last set", perhaps might work. I'll see if I can figure out a few.

In your example data, this works. I in particular don't like the price search; that one may well fail with a more complicated set of data. You can figure out adding the decimal for yourself.

data have;
infile cards truncover;
  input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
blahblah :23 blahblah
blahblahblah 23 blah blah
;
run;

data want;
set have;
rx_date   = prxparse('~_(\d{6,8})[_\.]~io');
rx_price  = prxparse('~_(\d+)_.*?(?=_\d+[_\.]).*?(?!_\d+[_\.])~io');
rx_bundle = prxparse('~(?!_\d+_)_(\dx)~io');
rx_size   = prxparse('~_(\d+x\d+)[_\.]~io');
rx_adnum  = prxparse('~\s:?(\d\d)\s~io');

rc_date   = prxmatch(rx_date,text);
rc_price  = prxmatch(rx_price,text);
rc_bundle = prxmatch(rx_bundle,text);
rc_size   = prxmatch(rx_size,text);
rc_adnum  = prxmatch(rx_adnum,text);

if rc_date   then datevar = prxposn(rx_date,1,text);
if rc_price  then price = prxposn(rx_price,1,text);
if rc_bundle then bundle = prxposn(rx_bundle,1,text);
if rc_size   then size   = prxposn(rx_size,1,text);
if rc_adnum  then adnum  = prxposn(rx_adnum,1,text);

run;

Upvotes: 3

Related Questions