user3622769
user3622769

Reputation: 13

Parse the lines of a predictably formatted text file

I am trying extract some formatted info from files.

Sample data

2011/09/20  00:57       367,044,608 S1E04 - Cancer Man.avi
2012/03/12  03:01       366,991,496 Family Guy - S09E01 - And Then There Were Fewer.avi
2012/03/25  00:27        53,560,510 Avatar- The Legend of Korra S01E01.avi

What i would like to extract is the Date, File size and name of the file, remembering that the file can start with basically anything. and file size changes all the time.

What I have currently.

$dateModifyed = substr($file, 0, 10); 
$fileSize = preg_match('[0-9]*/[0-9]*/[0-9]*/s[0-9]*:[0-9]*/s*', $file, $match)
$FileName = 

The full code that I am working on:

function recursivePrint($folder, $subFolders, $Jsoncounter) {
    $f = fopen("file.json", "a");
    
    echo '{ "id" : "' . $GLOBALS['Jsoncounter'] . '", parent" : "' . "#" . '", Text" : "' . $folder . '" },' . "\n";
    $PrintString = '{ "id" : "' . $GLOBALS['Jsoncounter'] . '", parent" : "' . "#" . '", Text" : "' . $folder . '" },' . "\n";
    fwrite($f, $PrintString);
    $foldercount = $GLOBALS['Jsoncounter'];
    $GLOBALS['Jsoncounter']++;
    foreach($subFolders->files as $file) {


        preg_match('/^(\d{4}/\d{2}/\d{2}\s+\d{2}:\d{2})\s+([\d,]+)\s+(.*)$/', $file, $match);
        $dateModified = $match[1];
        $fileSize = str_replace(',', '', $match[2]);
        $fileName = $match[3];
        echo $dateModified . $fileSize . $fileName;


        echo '{ "id" : "' . $GLOBALS['Jsoncounter'] . '", parent" : "' . $foldercount . '", Text" : "' . $file . '" },';
        $PrintString ='{ "id" : "' . $GLOBALS['Jsoncounter'] . '", parent" : "' . $foldercount . '", Text" : "' . $file . '" },';
        fwrite($f, $PrintString);
        $GLOBALS['Jsoncounter']++;
    }
    
    foreach($subFolders->folders as $folder => $subSubFolders) {
        recursivePrint($folder, $subSubFolders, $Jsoncounter);
    }
    fclose($f); 
}

Upvotes: 1

Views: 326

Answers (3)

mickmackusa
mickmackusa

Reputation: 47894

While preg_match() is certainly a viable technique and preg_match_all() can parse the whole file in one-go, you should also consider the seldom enjoyed fscanf() function which is specifically designed to parse lines of predictably formatted text directly from a file handle. One difference versus preg_match() and preg_match_all() is that it can return the desired values without any unneeded elements (like the full string match).

$result = [];
if ($handle = fopen($file, 'r')) {
    while (fscanf($handle, "%s *%s %s %[^\n]", $date, $size, $title)) {
    $result[] = [
        'date' => $date,
        'size' => (int) str_replace(',', '', $size),
        'title' => $title
    ];
}
fclose($handle);
echo json_encode($result);  // print fully-formed, valid JSON string

It is important to remind everyone to avoid the temptation to manually create json strings -- it exposes your script to potentially generating invalid JSON which can be a headache to repair.

Notice how you have:

echo '{ "id" : "' . $GLOBALS['Jsoncounter'] . '", parent" : "' . $foldercount . '", Text" : "' . $file . '" },';
// whoops ---------------------------------------^---------------------------------^
// your manually written json is missing leading double quotes on two keys

Upvotes: 0

Barmar
Barmar

Reputation: 780994

You need to use capture groups to get the parts of the string that are matched by different parts of the regular expression. Capture groups use parentheses around portions of the regexp.

preg_match('#^(\d{4}/\d{2}/\d{2}\s+\d{2}:\d{2})\s+([\d,]+)\s+(.*)$#', $string, $match);
$dateModified = $match[1];
$fileSize = str_replace(',', '', $match[2]);
$fileName = $match[3];

Other problems in your regexp:

  • You left out the delimiters at the beginning and end.
  • You used /s instead of \s for whitespace characters.

There's a tutorial on regular expressions at www.regular-expressions.info.

Upvotes: 1

Tim Pietzcker
Tim Pietzcker

Reputation: 336158

There are several problems in your regex:

preg_match('[0-9]*/[0-9]*/[0-9]*/s[0-9]*:[0-9]*/s*', $file, $match)
            ^--missing delimiter ^            ^-- asterisk instead of plus
                                 |--literal s instead of \s

and of course you haven't used anchors or capturing groups, and the regex isn't finished yet.

Try the following:

preg_match_all(
    '%^                     # Start of line
    ([0-9]+/[0-9]+/[0-9]+)  # Date (group 1)
    \s+                     # Whitespace
    ([0-9]+:[0-9]+)         # Time (group 2)
    \s+                     # Whitespace
    ([0-9,]+)               # File size (group 3)
    \s+                     # Whitespace
    (.*)                    # Rest of the line%mx', 
    $file, $result, PREG_SET_ORDER);
for ($matchi = 0; $matchi < count($result); $matchi++) {
    for ($backrefi = 0; $backrefi < count($result[$matchi]); $backrefi++) {
        # Matched text = $result[$matchi][$backrefi];

so for example $result[0][1] will contain 2011/09/20, and $result[2][4] will contain Avatar- The Legend of Korra S01E01.avi etc.

Upvotes: 1

Related Questions