vashts85
vashts85

Reputation: 1147

BeautifulSoup: extracting attribute for various items

Let's say we have HTML like this (sorry, I don't know how to copy and paste page info and this is on an intranet):

enter image description here

And I want to get the highlighted portion for all of the questions (this is like a Stack Overflow page). EDIT: to be clearer, what I am interested in is getting a list that has:

['question-summary-39968',
 'question-summary-40219',
 'question-summary-42899',
 'question-summary-34348',
 'question-summary-32497',
 'question-summary-35308',
...]

Now I know that a working solution is a list comprehension where I could do:

[item["id"] for item in html_df.find_all(class_="question-summary")]

But this is not exactly what I want. How can I directly access question-summary-41823 for the first item?

Also, what is the difference between soup.select and soup.get?

Upvotes: 0

Views: 59

Answers (1)

vashts85
vashts85

Reputation: 1147

I thought I would post my answer here if it helps others.

What I am trying to do is access the id attribute within the question-summary class.

Now you can do something like this and obtain it for only the first item (object?):

html_df.find(class_="question-summary")["id"]

But you want it for all of them. So you could do this to get the class data:

html_df.select('.question-summary')

But you can't just do

html_df.select('.question-summary')["id"]

Because you have a list filled with bs4.elements. So you need to iterate over the list and select just the piece that you want. You could do a for loop but a more elegant way is to just use list comprehension:

[item["id"] for item in html_df.find_all(class_="question-summary")]

Breaking down what this does, it:

  • It first creates a list of all the question-summary objects from the soup
  • Iterates over each element in the list, which we've named item
  • Extracts the id attribute and adds it to the list

Alternatively you can use select:

[item["id"] for item in html_df.find_all(class_="question-summary")]

I prefer the first version because it's more explicit, but either one results in:

['question-summary-43960',
 'question-summary-43953',
 'question-summary-43959',
 'question-summary-43947',
 'question-summary-43952',
 'question-summary-43945',
...]

Upvotes: 1

Related Questions