Reputation: 31205

Need help extracting data from html file using regex

I'm trying to pull data out of an html file into an array using PHP regex. Below are two rows of the datafile. I want to extract the partnumber (the 9517170 is one example), model, make, and the download URL. Here is my failed regex attempt to extract the part number and URL:

/Row[0|1] ([0-9]+)"(.*?)(\/component[0-9a-zA-Z_:-\/]+)/

Any regex gurus out there that can get me pointed in the right direction?

Thanks!

    <tr id="table_6_row_127" class="fabrik_row oddRow1 9517170">
            <td class="fabrik_row___jos_baseplates___DemcoPart" ><a class='fabrik___rowlink' href='/baseplates/fitlist/details/6/6/127.html'>9517170</a></td>
            <td class="fabrik_row___jos_baseplates___Make" >Subaru</td>
            <td class="fabrik_row___jos_baseplates___Model" >Legacy Outback *4</td>
            <td class="fabrik_row___jos_baseplates___Years" >03-04</td>
            <td class="fabrik_row___jos_baseplates___A" >3</td>
            <td class="fabrik_row___jos_baseplates___B" >25</td>
            <td class="fabrik_row___jos_baseplates___C" >23</td>
            <td class="fabrik_row___jos_baseplates___D" >15 1/2</td>
            <td class="fabrik_row___jos_baseplates___Price" >370</td>
            <td class="fabrik_row___jos_baseplates___Download" ><a href='/component/docman/doc_download/250-tp20170.html' target='_self'>TP20170</a></td>
    </tr>
<tr id="table_6_row_431" class="fabrik_row oddRow0 9518272">
            <td class="fabrik_row___jos_baseplates___DemcoPart" ><a class='fabrik___rowlink' href='/baseplates/fitlist/details/6/6/431.html'>9518272</a></td>
            <td class="fabrik_row___jos_baseplates___Make" >Subaru</td>
            <td class="fabrik_row___jos_baseplates___Model" >Outback *4*9</td>
            <td class="fabrik_row___jos_baseplates___Years" >10-11</td>
            <td class="fabrik_row___jos_baseplates___A" >3</td>
            <td class="fabrik_row___jos_baseplates___B" >30</td>
            <td class="fabrik_row___jos_baseplates___C" >25-1/8"</td>
            <td class="fabrik_row___jos_baseplates___D" >17-1/4"</td>
            <td class="fabrik_row___jos_baseplates___Price" >370</td>
            <td class="fabrik_row___jos_baseplates___Download" ><a href='http://demco-products.com/component/docman/doc_download/921-tp20272.html' target='_self'>tp20272</a></td>
    </tr>

Upvotes: 0

Answers (1)

Andrew White

Reputation: 53496

Use DOMDocument::loadHTML? It uses libxml under the hood which is fast and robust.

Don't try to parse HTML with regex's.

I made that bold because I see it a lot on here and the solutions are always fragile at best and buggy at worst. Once you use a true HTML parser to get the attributes you want then using a regex is more reasonable.

Upvotes: 2

Need help extracting data from html file using regex

Answers (1)

Related Questions