Andy F
Andy F

Reputation: 235

Reading WebVTT files in PHP

Does anyone have experience with reading WebVTT (.vtt) files using PHP?

I'm developing an application in CakePHP where I need to read through a bunch of vtt files and get the start time and associated text.

So as an example of the file:

00:00.999 --> 00:04.999
sentence one

00:04.999 --> 00:07.999
sentence two

00:07.999 --> 00:10.999
third sentence
with a line break

00:10.999 --> 00:14.999
a fourth sentence
on three
lines

I need to be able to extract something like this:

00:00.999 sentence one
00:04.999 sentence two
00:07.999 third sentence with a line break
00:10.999 a fourth sentence on three lines

Note that there can be line breaks so there's no set number of lines between each timestamp.

My plan was to search for "-->" which is a common string between each timestamp. Does anyone have any ideas how best to achieve this?

Upvotes: 1

Views: 4411

Answers (3)

Mantas D
Mantas D

Reputation: 4150

To parse file you can use library like this:

$subtitles = Subtitles::loadFromFile('subtitles.vtt');
$blocks = $subtitles->getInternalFormat(); // array

foreach ($blocks as $block) {
    echo $block['start'];
    echo ' ';
    foreach ($block['lines'] as $line) {
        echo $line . ' ';
    }
    echo "\n";
} 

It will also get text from files containing styles and other small errors.

https://github.com/mantas-done/subtitles

Upvotes: 2

Andy F
Andy F

Reputation: 235

This seems to achieve what I need, i.e. outputs the Start Time and any subsequent lines of text. The files I'm using are fairly small so using PHP's file() function to read everything into an array seems ok; not sure this would work well on large files though.

    $file = 'test.vtt'; 
    $file_as_array = file($file, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);

    foreach ($file_as_array as $f) {    

        // Find lines containing "-->"  
        $start_time = false;
        if (preg_match("/^(\d{2}:[\d\.]+) --> \d{2}:[\d\.]+$/", $f, $match)) {              
            $start_time = explode('-->', $f);
            $start_time = $start_time[0];
            echo '<br>';
            echo $start_time;
        }

        // It's a line of the file that doesn't include a timestamp, so it's caption text. Ignore header of file which includes the word 'WEBVTT'
        if (!$start_time && (!strpos($f, 'WEBVTT')) ) {             
            echo ' ' . $f . ' ';
        }   

    }       
}

Upvotes: 1

kums
kums

Reputation: 2691

You can do something like this:

<?PHP

function send_reformatted($vtt_file){
 // Add these headers to ease saving the output as text file
    header("Content-type: text/plain");
    header('Content-Disposition: inline; filename="'.$vtt_file.'.txt"');

    $f = fopen($vtt_file, "r");
    $line_new = "";

    while($line = fgets($f)){
        if (preg_match("/^(\d{2}:[\d\.]+) --> \d{2}:[\d\.]+$/", $line, $match)) {
            if($line_new) echo $line_new."\n";
            $line_new = $match[1];
        } else{
            $line = trim($line);
            if($line) $line_new .= " $line";
        }
    }

    echo $line_new."\n";
    fclose($f);
}


send_reformatted("test.vtt");

?>

Upvotes: 0

Related Questions