pawelty
pawelty

Reputation: 1000

extracting data between tags

I have to extract data from many sites structured like this http://www.firmenmonitor.at/Secure/CompanyDetail.aspx?CID=408053&SID=4af735f7-4eb7-4f8e-a1df-948f6fb66a18&PID=1

I am interested in second 'textModule' div. There are three sections:

In der Rolle Aufsichtsrat waren oder sind gemeldet:
(...)
In der Rolle Geschäftsführer waren oder sind gemeldet:
(...)
In der Rolle Gesellschafter waren oder sind gemeldet:
(...)

I know how to extract names and other info but I would like to know the section each member is member of. For example:

Köhlmeier Harald - Aufsichtsrat
Mazzel Josef - Aufsichtsrat
(...)
Konstatzky Adolf F. - Geschäftsführer

My issue is that this div has very flat structure and the header for each section is just <h3>. So I don't know how to figure out where one section finishes and another starts. I can't really show you what I've tried so far as I have no idea how to approach it... Any hints?

Upvotes: 0

Views: 47

Answers (1)

Ulf Aslak
Ulf Aslak

Reputation: 8608

If I understand your question correctly you're just looking for a way to split the three sections, so you can process each independently and extract data with the knowledge of which section it belongs to.

In that case you can just leverage the fact that the exact string <h3 is what seperates the sections. You can simply extract the second div, save it as a string named e.g. second_div and do a second_div.split("<h3") to get a list object where items 1, 2 and 3 (not 0) contain html code in seperate sections.

Upvotes: 1

Related Questions