Reputation: 143
How can I get the specific item with same Class name and attributes?
I need to get these 3 items
April 14, 2013
580
Fort Pierce, FL
<dl class="pairsJustified">
<dt>Joined:</dt>
<dd>Apr 14, 2013</dd>
</dl>
<dl class="pairsJustified">
<dt>Messages:</dt>
<dd><a href="search/member?user_id=13302" class="concealed"
rel="nofollow">580</a></dd>
</dl>
<dl class="pairsJustified">
<dt>Location:</dt>
<dd>
<a href="misc/location-info?location=Fort+Pierce%2C+FL" target="_blank"
rel="nofollow noreferrer" itemprop="address" class="concealed">Fort
Pierce, FL</a>
Upvotes: 0
Views: 44
Reputation: 1416
this is a good starting point:
In [18]: for a in response.css('.extraUserInfo'):
...: print(a.css('*::text').extract())
...: print('\n\n\n')
...:
['\n', '\n', '\n', '\n'] # <--this (and other outputs like this) is because there is an extra `extraUserInfo` class block above the desired info block if the user has a user group picture/avatar below their username
['\n', '\n', 'Joined:', '\n', 'Mar 24, 2013', '\n', '\n', '\n', 'Messages:', '\n', '6,747', '\n', '\n']
['\n', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Mar 24, 2013', '\n', '\n', '\n', 'Messages:', '\n', '6,747', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Apr 14, 2013', '\n', '\n', '\n', 'Messages:', '\n', '580', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Fort Pierce, FL', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Oct 20, 2012', '\n', '\n', '\n', 'Messages:', '\n', '2,476', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Philadelphia, PA', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Dec 11, 2012', '\n', '\n', '\n', 'Messages:', '\n', '2,938', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Colorado', '\n', '\n', '\n']
['\n', '\n', 'Joined:', '\n', 'Sep 30, 2016', '\n', '\n', '\n', 'Messages:', '\n', '833', '\n', '\n', '\n', 'Location:', '\n', '\n', 'Indiana', '\n', '\n', '\n']
...
There are many ways to approach this. A little fiddling around will get the data formatted to your liking. The approach above is only a good starting point because there are many lines with only newline character lists as outputs, thats because (it seems) that user info blocks where the user has a user-group image (like tesla of arizona) then the extraUserInfo
class is also used to group that block of html. There will be better ways to group this...
Basically response.css('.extraUserInfo') will aggregate all blocks with class extraUserInfo
which seems to be the blocks holding the user info you're looking for.
From there extract all underlying text with the ::text
pseudo selector and parse the arrays.
There is definitely a better way to approach this if you carefully look at the html structure so you are extracting it in a way that leaves you less processing work afterwards but this should get you on the right track. CSS selectors or xpath documentation should be great help.
Upvotes: 0
Reputation: 16772
Using they lie under the <dd>
tag, using .find_all()
:
from bs4 import BeautifulSoup
test = '''<dl class="pairsJustified">
<dt>Joined:</dt>
<dd>Apr 14, 2013</dd>
</dl>
<dl class="pairsJustified">
<dt>Messages:</dt>
<dd><a href="search/member?user_id=13302" class="concealed"
rel="nofollow">580</a></dd>
</dl>
<dl class="pairsJustified">
<dt>Location:</dt>
<dd>
<a href="misc/location-info?location=Fort+Pierce%2C+FL" target="_blank"
rel="nofollow noreferrer" itemprop="address" class="concealed">Fort Pierce, FL</a>'''
soup = BeautifulSoup(test, 'html.parser')
data = soup.find_all("dd")
for d in data:
print(d.text.strip())
OUTPUT:
Apr 14, 2013
580
Fort Pierce, FL
Upvotes: 1