Reputation: 31205
I'm trying to pull data out of an html file into an array using PHP regex. Below are two rows of the datafile. I want to extract the partnumber (the 9517170 is one example), model, make, and the download URL. Here is my failed regex attempt to extract the part number and URL:
/Row[0|1] ([0-9]+)"(.*?)(\/component[0-9a-zA-Z_:-\/]+)/
Any regex gurus out there that can get me pointed in the right direction?
Thanks!
<tr id="table_6_row_127" class="fabrik_row oddRow1 9517170">
<td class="fabrik_row___jos_baseplates___DemcoPart" ><a class='fabrik___rowlink' href='/baseplates/fitlist/details/6/6/127.html'>9517170</a></td>
<td class="fabrik_row___jos_baseplates___Make" >Subaru</td>
<td class="fabrik_row___jos_baseplates___Model" >Legacy Outback *4</td>
<td class="fabrik_row___jos_baseplates___Years" >03-04</td>
<td class="fabrik_row___jos_baseplates___A" >3</td>
<td class="fabrik_row___jos_baseplates___B" >25</td>
<td class="fabrik_row___jos_baseplates___C" >23</td>
<td class="fabrik_row___jos_baseplates___D" >15 1/2</td>
<td class="fabrik_row___jos_baseplates___Price" >370</td>
<td class="fabrik_row___jos_baseplates___Download" ><a href='/component/docman/doc_download/250-tp20170.html' target='_self'>TP20170</a></td>
</tr>
<tr id="table_6_row_431" class="fabrik_row oddRow0 9518272">
<td class="fabrik_row___jos_baseplates___DemcoPart" ><a class='fabrik___rowlink' href='/baseplates/fitlist/details/6/6/431.html'>9518272</a></td>
<td class="fabrik_row___jos_baseplates___Make" >Subaru</td>
<td class="fabrik_row___jos_baseplates___Model" >Outback *4*9</td>
<td class="fabrik_row___jos_baseplates___Years" >10-11</td>
<td class="fabrik_row___jos_baseplates___A" >3</td>
<td class="fabrik_row___jos_baseplates___B" >30</td>
<td class="fabrik_row___jos_baseplates___C" >25-1/8"</td>
<td class="fabrik_row___jos_baseplates___D" >17-1/4"</td>
<td class="fabrik_row___jos_baseplates___Price" >370</td>
<td class="fabrik_row___jos_baseplates___Download" ><a href='http://demco-products.com/component/docman/doc_download/921-tp20272.html' target='_self'>tp20272</a></td>
</tr>
Upvotes: 0
Views: 216
Reputation: 53496
Use DOMDocument::loadHTML? It uses libxml under the hood which is fast and robust.
Don't try to parse HTML with regex's.
I made that bold because I see it a lot on here and the solutions are always fragile at best and buggy at worst. Once you use a true HTML parser to get the attributes you want then using a regex is more reasonable.
Upvotes: 2