Reputation: 23
I have a fairly large (~65MB) XML file with nearly 1 million lines and I am using Laravel to parse and process the contents and then insert the new data into a MySQL database.
It is a music library that I update regularly and the software I use generates this XML file.
The code itself works fine however it takes a very LONG time. Over 30 minutes to process around 50,000 records! I'm looking for a way to speed this up. I am using Laravel 6 on Ubuntu Server running Apache if that helps.
I basically read the XML file, extract what I need, sanitize the data a little, and then insert it into my database. Here is the relevant part of my code. Can anyone suggest a better way to make this more efficient? I'm not a Laravel expert, so any feedback would be great.
use App\Music;
Music::truncate(); //clear existing data
\DB::disableQueryLog(); //helps speed up queries by disabling log
ini_set('memory_limit', '512M'); //boost memory limit
ini_set('max_execution_time', '90'); //try to prevent time-out
//list of files to import (I sometimes have more than 1):
$files = [
"path/to/my/database.xml",
"path/to/my/database2.xml"
];
$video_files = ["mp4","mov","avi","flv"]; //used to identify music videos
foreach($files as $file){
$reader = new XMLReader();
if(!$reader->open($file)){
die("Failed to open xml file!");
}
$doc = new DOMDocument;
while ($reader->read() && $reader->name !== 'Song');
while ($reader->name === 'Song'){
$song = simplexml_import_dom($doc->importNode($reader->expand(), true));
if(strpos($song['FilePath'], 'netsearch://') === false && strpos($song['FilePath'], ':/DJ Tools/') === false){
$music = new Music; //create new instance of model
foreach ($song->Tags as $tag){
if(($tag['Author'] != "" || $tag['Title'] != "") && ($tag['Grouping'] != "Studio")){
$insert = true; //insert record or not
foreach($song->Infos as $info){
$music->length = gmdate("H:i:s",floatval($info['SongLength']));
$music->file_date = date("Y-m-d",intval($info['FirstSeen']));
}
if($insert == true){
$music->bpm = ($tag['Bpm'] > 0) ? round(1 / floatval($tag['Bpm']) * 60) : null; //to calculate use 1/bpm * 60 and round
$music->file_path = $song['FilePath'];
$music->artist = trim($tag['Author']);
$music->title = trim($tag['Title']);
$music->remix = trim($tag['Remix']);
$music->album = trim($tag['Album']);
$music->genre = trim($tag['Genre']);
$music->filetype = substr($song['FilePath'],-3);
$music->year = ($tag['Year'] > 0) ? intval($tag['Year']) : null;
//set the kind (audio, video or karaoke):
if(strpos($song['FilePath'], '/Karaoke/') !== false){
$kind = "karaoke";
}
elseif(in_array(strtolower(substr($song['FilePath'],-3)),$video_files)){
$kind = "video";
}
else{
$kind = "audio";
}
$music->kind = $kind;
$music->save(); //adds song to mysql
}//end if insert true
} //end has title or author + non-studio
} //end for each tag
} //end not a netsearch file
$reader->next('Song');
} //end while
$reader->close();
} //end for each files
The structure of the XML file looks something like this:
<Song FilePath="D:/Path/To/Music/Michael Jackson/The Ultimate Collection/2-03 Thriller.mp3" FileSize="12974048">
<Tags Author="Michael Jackson" Title="Thriller" Genre="Pop" Album="The Ultimate Collection" Composer="Rod Temperton" TrackNumber="3/11" Grouping="Halloween" Year="2004" Bpm="0.504202" Key="G#m" Flag="1" />
<Infos SongLength="356.960363" FirstSeen="1501430558" Bitrate="282" Cover="1" />
<Comment>Great for parties</Comment>
<Scan Version="801" Bpm="0.506077" AltBpm="0.379569" Volume="1.101067" Key="G#m" Flag="32768" />
<Poi Pos="17.171541" Type="beatgrid" />
<Poi Pos="0.634195" Type="automix" Point="realStart" />
<Poi Pos="356.051882" Type="automix" Point="realEnd" />
<Poi Pos="17.30" Type="automix" Point="fadeStart" />
<Poi Pos="352.750" Type="automix" Point="fadeEnd" />
<Poi Pos="41.695057" Type="automix" Point="cutStart" />
<Poi Pos="343.074830" Type="automix" Point="cutEnd" />
<Poi Pos="44.289569" Type="automix" Point="tempoStart" />
<Poi Pos="298.550091" Type="automix" Point="tempoEnd" />
</Song>
<Song FilePath="D:/Path/To/Music/Black Sabbath/We Sold Our Soul for Rock 'n' Roll/09 Sweet Leaf.m4a" FileSize="10799807">
<Tags Author="Black Sabbath" Title="Sweet Leaf" Genre="Heavy Metal" Album="We Sold Our Soul For Rock 'n' Roll" Composer="Geezer Butler" TrackNumber="9/14" Year="1987" Key="Am" Flag="1" />
<Infos SongLength="306.456961" FirstSeen="1501430556" Bitrate="259" Cover="1" />
<Scan Version="801" Bpm="0.411757" AltBpm="0.617438" Volume="0.680230" Key="Am" Flag="32768" />
<Poi Pos="1.753537" Type="beatgrid" />
<Poi Pos="0.220590" Type="automix" Point="realStart" />
<Poi Pos="301.146848" Type="automix" Point="realEnd" />
<Poi Pos="0.30" Type="automix" Point="fadeStart" />
<Poi Pos="291.50" Type="automix" Point="fadeEnd" />
</Song>
...tens of thousands of more songs, nearly 1 million lines
Upvotes: 0
Views: 1954
Reputation: 416
If you create an instance and insert a record in each loop, it will create a Music instance and 1 insert query for each loop and this is not efficient. What if you save data to an array first, chunk it, and then save it to database.
For example, you have 1000 music data, if you make music instance for each loop, it will create 1000 times of Music instance and 1000 times of insert operation in database. But if you save your music data first into an array and chunk it to 20 array (each array contains 50 music data) it will execute 20 insert operations only. Kinda more efficient, isn't?
So, your codes will looks like it :
<?php
use App\Music;
Music::truncate(); //clear existing data
\DB::disableQueryLog(); //helps speed up queries by disabling log
ini_set('memory_limit', '512M'); //boost memory limit
ini_set('max_execution_time', '90'); //try to prevent time-out
//list of files to import (I sometimes have more than 1):
$files = [
"path/to/my/database.xml",
"path/to/my/database2.xml"
];
$video_files = ["mp4","mov","avi","flv"]; //used to identify music videos
//declare array of music here
$arrayOfMusic = [];
foreach($files as $file){
$reader = new XMLReader();
if(!$reader->open($file)){
die("Failed to open xml file!");
}
$doc = new DOMDocument;
while ($reader->read() && $reader->name !== 'Song');
while ($reader->name === 'Song') {
$song = simplexml_import_dom($doc->importNode($reader->expand(), true));
if(strpos($song['FilePath'], 'netsearch://') === false && strpos($song['FilePath'], ':/DJ Tools/') === false) {
foreach ($song->Tags as $tag) {
if (($tag['Author'] != "" || $tag['Title'] != "") && ($tag['Grouping'] != "Studio")) {
$insert = true; //insert record or not
foreach ($song->Infos as $info) {
$length = gmdate("H:i:s",floatval($info['SongLength']));
$file_date = date("Y-m-d",intval($info['FirstSeen']));
}
if($insert == true){
//set the kind (audio, video or karaoke):
if(strpos($song['FilePath'], '/Karaoke/') !== false){
$kind = "karaoke";
} elseif (in_array(strtolower(substr($song['FilePath'],-3)),$video_files)) {
$kind = "video";
} else{
$kind = "audio";
}
//Fill array of music
$arrayOfMusic[] = [
'bpm' => ($tag['Bpm'] > 0) ? round(1 / floatval($tag['Bpm']) * 60) : null, //to calculate use 1/bpm * 60 and round
'file_path' => $song['FilePath'],
'artist' => trim($tag['Author']),
'length' => $length ?? '0', //set $length to 0 if it cannot be found
'file_date' => $file_date ?? '0', //set $file_date to 0 if it cannot be found
'title' => trim($tag['Title']),
'remix' => trim($tag['Remix']),
'album' => trim($tag['Album']),
'genre' => trim($tag['Genre']),
'filetype' => substr($song['FilePath'],-3),
'year' => ($tag['Year'] > 0) ? intval($tag['Year']) : null;
'kind' => $kind,
];
}//end if insert true
} //end has title or author + non-studio
} //end for each tag
} //end not a netsearch file
$reader->next('Song');
} //end while
$reader->close();
} //end for each files
//Chunk the array if $arrayOfMusic is not null
if (!empty($arrayOfMusic)) {
$arrayOfMusicChunked = array_chunk($arrayOfMusic, 30); //Chunk large array, in this example, chunked array will contains 30 items
//loop the array and insert it use insert() function
foreach ($arrayOfMusicChunked as $arrayOfMusicToSave) {
Music::insert($arrayOfMusicToSave);
}
}
Source
Upvotes: 3