Christian Read
Christian Read

Reputation: 143

How to get specific item having same class name and attributes

How can I get the specific item with same Class name and attributes?

I need to get these 3 items

April 14, 2013

580

Fort Pierce, FL

<dl class="pairsJustified">
<dt>Joined:</dt>
<dd>Apr 14, 2013</dd>
</dl>
<dl class="pairsJustified">
<dt>Messages:</dt>
<dd><a href="search/member?user_id=13302" class="concealed" 
rel="nofollow">580</a></dd>
</dl>

<dl class="pairsJustified">
<dt>Location:</dt>
<dd>
<a href="misc/location-info?location=Fort+Pierce%2C+FL" target="_blank" 
rel="nofollow noreferrer" itemprop="address" class="concealed">Fort 
Pierce, FL</a>

Upvotes: 0

Views: 44

Answers (2)

Verbal_Kint
Verbal_Kint

Reputation: 1416

this is a good starting point:

In [18]: for a in response.css('.extraUserInfo'):
    ...:     print(a.css('*::text').extract())
    ...:     print('\n\n\n')
    ...:     
['\n', '\n', '\n', '\n']  # <--this (and other outputs like this) is because there is an extra `extraUserInfo` class block above the desired info block if the user has a user group picture/avatar below their username




['\n', '\n', 'Joined:', '\n', 'Mar 24, 2013', '\n', '\n', '\n', 'Messages:', '\n', '6,747', '\n', '\n']




['\n', '\n', '\n', '\n']




['\n', '\n', 'Joined:', '\n', 'Mar 24, 2013', '\n', '\n', '\n', 'Messages:', '\n', '6,747', '\n', '\n']




['\n', '\n', 'Joined:', '\n', 'Apr 14, 2013', '\n', '\n', '\n', 'Messages:', '\n', '580', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Fort Pierce, FL', '\n', '\n', '\n']




['\n', '\n', 'Joined:', '\n', 'Oct 20, 2012', '\n', '\n', '\n', 'Messages:', '\n', '2,476', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Philadelphia, PA', '\n', '\n', '\n']




['\n', '\n', 'Joined:', '\n', 'Dec 11, 2012', '\n', '\n', '\n', 'Messages:', '\n', '2,938', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Colorado', '\n', '\n', '\n']




['\n', '\n', 'Joined:', '\n', 'Sep 30, 2016', '\n', '\n', '\n', 'Messages:', '\n', '833', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Indiana', '\n', '\n', '\n']


...

There are many ways to approach this. A little fiddling around will get the data formatted to your liking. The approach above is only a good starting point because there are many lines with only newline character lists as outputs, thats because (it seems) that user info blocks where the user has a user-group image (like tesla of arizona) then the extraUserInfo class is also used to group that block of html. There will be better ways to group this...

Basically response.css('.extraUserInfo') will aggregate all blocks with class extraUserInfo which seems to be the blocks holding the user info you're looking for. From there extract all underlying text with the ::text pseudo selector and parse the arrays.

There is definitely a better way to approach this if you carefully look at the html structure so you are extracting it in a way that leaves you less processing work afterwards but this should get you on the right track. CSS selectors or xpath documentation should be great help.

Upvotes: 0

DirtyBit
DirtyBit

Reputation: 16772

Using they lie under the <dd> tag, using .find_all():

from bs4 import BeautifulSoup

test = '''<dl class="pairsJustified">
<dt>Joined:</dt>
<dd>Apr 14, 2013</dd>
</dl>
<dl class="pairsJustified">
<dt>Messages:</dt>
<dd><a href="search/member?user_id=13302" class="concealed" 
rel="nofollow">580</a></dd>
</dl>

<dl class="pairsJustified">
<dt>Location:</dt>
<dd>
<a href="misc/location-info?location=Fort+Pierce%2C+FL" target="_blank" 
rel="nofollow noreferrer" itemprop="address" class="concealed">Fort Pierce, FL</a>'''

soup = BeautifulSoup(test, 'html.parser')
data = soup.find_all("dd")
for d in data:
    print(d.text.strip())

OUTPUT:

Apr 14, 2013
580
Fort Pierce, FL

Upvotes: 1

Related Questions