Reputation: 1000
I have to extract data from many sites structured like this http://www.firmenmonitor.at/Secure/CompanyDetail.aspx?CID=408053&SID=4af735f7-4eb7-4f8e-a1df-948f6fb66a18&PID=1
I am interested in second 'textModule'
div. There are three sections:
In der Rolle Aufsichtsrat waren oder sind gemeldet:
(...)
In der Rolle Geschäftsführer waren oder sind gemeldet:
(...)
In der Rolle Gesellschafter waren oder sind gemeldet:
(...)
I know how to extract names and other info but I would like to know the section each member is member of. For example:
Köhlmeier Harald - Aufsichtsrat
Mazzel Josef - Aufsichtsrat
(...)
Konstatzky Adolf F. - Geschäftsführer
My issue is that this div
has very flat structure and the header for each section is just <h3>
. So I don't know how to figure out where one section finishes and another starts. I can't really show you what I've tried so far as I have no idea how to approach it... Any hints?
Upvotes: 0
Views: 47
Reputation: 8608
If I understand your question correctly you're just looking for a way to split the three sections, so you can process each independently and extract data with the knowledge of which section it belongs to.
In that case you can just leverage the fact that the exact string <h3
is what seperates the sections. You can simply extract the second div, save it as a string named e.g. second_div
and do a second_div.split("<h3")
to get a list object where items 1, 2 and 3 (not 0) contain html code in seperate sections.
Upvotes: 1