knalpiap
knalpiap

Reputation: 67

PHP Regexp on filename

I have a collection of files with a certain structure:

COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf

Breakdown:

In the best case my result would be an array with above info with named keys but wouldn't know where to start.

Help would be greatly appreciated!

Thanks, Knal


Sorry to have been so unclear, but a few variables are not always present in the filename: - DE -> fixed options: '_DE', '_BE', or absent - RGB -> Colormode, fixed options: 'RGB', 'CMYK', 'PMS', or absent - ENG -> Language of file, fixed options: 'GER', 'ENG', or absent

Upvotes: 0

Views: 185

Answers (5)

knalpiap
knalpiap

Reputation: 67

Inspired by @Armatus i've constructed the following which appears to be fail-safe:

$string = "COMPANY_DE-Actual-Contents+of-File-RGB-ENG.pdf";
$options_location = array('DE','BE');
$options_color = array('RGB','CMYK','PMS');
$options_language = array('ENG','GER');
$parts = preg_split( '/[\.\-\_]/', $string, NULL, PREG_SPLIT_NO_EMPTY );

$data = array();
$data['company'] = array_shift($parts);
$data['filetype'] = array_pop($parts);

if( in_array( $parts[0], $options_location ) ){
$data['location'] = array_shift($parts);
}else{
$data['location'] = NULL;
};

if( in_array( end( $parts), $options_language ) ){
$data['language'] = array_pop($parts);
}else{
$data['language'] = NULL;
};

if( in_array( end( $parts), $options_color ) ){
$data['colormode'] = array_pop($parts);
}else{
$data['colormode'] = NULL;
};

$data['content'] = implode( ' ', $parts );
print_r( $data );

Upvotes: 0

Toto
Toto

Reputation: 91518

How about:

$files = array(
    'COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf',
    'COMPANY_BE-Actual-Contents-of-File-CMYK-ENG.pdf',
    'COMPANY_DE-Actual-Contents-of-File-PMS-GER.doc',
    'COMPANY-Actual-Contents-of-File-PMS-GER.doc',
    'COMPANY-Actual-Contents-of-File-GER.doc',
    'COMPANY-Actual-Contents-of-File.doc',
);

foreach($files as $file) {
    preg_match('/^(?<COMPANY>.*?)_?(?<LOCATION>DE|BE)?-(?<CONTENT>.*?)-?(?<COLOR>RGB|CMYK|PMS)?-?(?<LANG>ENG|GER)?\.(?<EXT>[^.]+)$/', $file, $m);
    echo "\nfile=$file\n";
    echo "COMPANY: ",$m['COMPANY'],"\n";
    echo "LOCATION: ",$m['LOCATION'],"\n";
    echo "CONTENT: ",$m['CONTENT'],"\n";
    echo "COLOR: ",$m['COLOR'],"\n";
    echo "LANG: ",$m['LANG'],"\n";
    echo "EXT: ",$m['EXT'],"\n";
}

output:

file=COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf
COMPANY: COMPANY
LOCATION: DE
CONTENT: Actual-Contents-of-File
COLOR: RGB
LANG: ENG
EXT: pdf

file=COMPANY_BE-Actual-Contents-of-File-CMYK-ENG.pdf
COMPANY: COMPANY
LOCATION: BE
CONTENT: Actual-Contents-of-File
COLOR: CMYK
LANG: ENG
EXT: pdf

file=COMPANY_DE-Actual-Contents-of-File-PMS-GER.doc
COMPANY: COMPANY
LOCATION: DE
CONTENT: Actual-Contents-of-File
COLOR: PMS
LANG: GER
EXT: doc

file=COMPANY-Actual-Contents-of-File-PMS-GER.doc
COMPANY: COMPANY
LOCATION:
CONTENT: Actual-Contents-of-File
COLOR: PMS
LANG: GER
EXT: doc

file=COMPANY-Actual-Contents-of-File-GER.doc
COMPANY: COMPANY
LOCATION:
CONTENT: Actual-Contents-of-File
COLOR:
LANG: GER
EXT: doc

file=COMPANY-Actual-Contents-of-File.doc
COMPANY: COMPANY
LOCATION:
CONTENT: Actual-Contents-of-File
COLOR:
LANG:
EXT: doc

Upvotes: 0

Armatus
Armatus

Reputation: 2191

Try not to use regular expressions if possible, or keep them as simple as it gets.

$text = "COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf";
$options_location = array('DE','BE');
$options_color = array('RGB','CMYK','PMS');
$options_language = array('ENG','GER');

//Does it have multiple such lines? In that case this:
$lines = explode("\n",$text);
//Then loop over this with a foreach, doing the following for each line:

$parts = preg_split('/[-_\.]/', $line);
$data = array(); //result array
$data['company'] = array_shift($parts); //The first element is always the company
$data['filetype'] = array_pop($parts); //The last bit is always the file type
foreach($parts as $part) { //we'll have to test each of the remaining ones for what it is
    if(in_array($part,$options_location))
        $data['location'] = $part;
    elseif(in_array($part,$options_color))
        $data['color'] = $part;
    elseif(in_array($part,$options_language))
        $data['lang'] = $part;
    else
        $data['content'] = isset($data['content']) ? $data['content'].' '.$part : $part; //Wasn't any of the others so attach it to the content
}

This is easier to understand as well, instead of having to figure out what exactly a regex is doing.

Note that this assumes that no part of the content can be one of the words which are reserved for location, color or language. If it is possible for these to occur within the contents, you will have to add conditions like isset($data['location']) to check if there was already another location found and if so add the correct one to the content instead of storing it as the location.

Upvotes: 1

s.webbandit
s.webbandit

Reputation: 17028

Something like that:

preg_match('#^([^_]+)(_[^-]+)?-([\w-]+)-(\w+)-(\w+)(\.\w+)$#i', 'COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf', $m);

preg_match('#^([^_]+)(_[^-]+)?-([\w-]+)-(\w+)[_-]([^_]+)(\.\w+)$#i', 'COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf', $m); // for both '_' and '-'

preg_match('#^(\p{Lu}+)(-\p{Lu}+)?-([\w]+)(\-(\p{Lu}+))?(\-(\p{Lu}+))?(\.\w+)$#', 'COMPANY-NL-Actual_Contents_of_File-RGB-ENG.pdf', $m); // if filename parts divider is strictly '-'

var_dump($m);

In last variant as you wewe asking if no country code (-NL) it will be NULL. But with color and langage codes it's not. Try it yourself and you'll figure it out how it works!

Upvotes: 0

Jack
Jack

Reputation: 5768

Try

$string = "COMPANY_DE-Actual-Contents-of-File-RGB-ENG.pdf";
$array = preg_split('/[-_\.]/', $string);

$len = count($array);
$struct = array($array[0], $array[1], '', $array[$len-3], $array[$len-2], $array[$len-1]);
unset($array[0], $array[1], $array[$len-3], $array[$len-2], $array[$len-1]);
$struct[2] = implode('-', $array);
var_dump($struct);

-

array
  0 => string 'COMPANY' (length=7)
  1 => string 'DE' (length=2)
  2 => string 'Actual-Contents-of-File' (length=23)
  3 => string 'RGB' (length=3)
  4 => string 'ENG' (length=3)
  5 => string 'pdf' (length=3)

Upvotes: 1

Related Questions