Roy Wu
Roy Wu

Reputation: 71

Python BeautifulSoup No tag and return empty

I have to get context like above but in html code isn't have tag to get data and I try to use get div.main-content to test but why data return empty?

import urllib
import urllib.request
import requests
from bs4 import BeautifulSoup
import re

links = ['https://www.ptt.cc/bbs/Gossiping/index'+str(i+1)+'.html' for i in range(15156)]

twice_link = []
data = {"yes":"yes"}
head = {"User-gent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"}
for ind, link in enumerate(links, 0):
    with requests.Session() as s:
        data["from"] = "/bbs/Gossiping/index{}.html".format(ind)
        s.post("https://www.ptt.cc/ask/over18", data=data, headers=head)
        res = s.get(link, headers=head)
        soup = BeautifulSoup(res.content,"html.parser")
        data_title = soup.select("div.title")
        data_date = soup.select("div.date")  #Date
        data_author =soup.select("div.author")  #Author
        data_times = soup.select("div.nrec")  #Count
    for x in range(1, 20):     
        url = data_title[x].find('a').get('href')[15:33] 
        data_link = ['https://www.ptt.cc/bbs/Gossiping/'+str(url)+'.html'] #訪問網址 這裡要二次訪問

        for ind, link2 in enumerate(data_link, 0):
            with requests.Session() as s:
                data["from"] = "/bbs/Gossiping/index{}.html".format(ind)
                s.post("https://www.ptt.cc/ask/over18", data=data, headers=head)
                res = s.get(link2, headers=head)
                soup2 = BeautifulSoup(res.content,"html.parser")
                body = soup2.find_all(text=True)
                twice_title = soup2.select("span.article-meta-value")
                twice_data1 = soup2.select_one("#main-content")
                print(twice_title[0].get_text())

Upvotes: 1

Views: 1714

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

It has an id main-content not a class

twice_data1 = soup2.select_one("#main-content")

Once you do that change, you will get your div.

You should also not be defining your functions repeatedly and as per the second part of my answer to your previous question, you only need to post once to confirm your age:

import requests
from bs4 import BeautifulSoup
import re


def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title', 'span']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True


data_link = ['https://www.ptt.cc/bbs/Gossiping/M.1119257927.A.60D.html' for i in range(1)]  # 訪問網址 這裡要二次訪問.
data = {"yes": "yes"}
head = {"User-gent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"}

with requests.Session() as s:
    data["from"] = "/bbs/Gossiping/index1.html"
    s.post("https://www.ptt.cc/ask/over18", data=data, headers=head)
    for link in data_link:
        res = s.get(link, headers=head)
        soup2 = BeautifulSoup(res.content, "html.parser")
        body = soup2.find_all(text=True)
        twice_title = soup2.select("span.article-meta-value")
        print(twice_title)
        twice_data1 = soup2.select_one("#main-content")
        print(twice_data1)

I also presume in your actual code that you do something with str.format in the data_link list comp which creates multiple urls over a range > 1.

To get the specific text, you can get the last span with the f6 class and find the following sibling text:

with requests.Session() as s:
    data["from"] = "/bbs/Gossiping/index1.html"
    s.post("https://www.ptt.cc/ask/over18", data=data, headers=head)
    for link in data_link:
        res = s.get(link, headers=head)
        soup2 = BeautifulSoup(res.content, "html.parser")
        body = soup2.find_all(text=True)
        twice_title = soup2.select("span.article-meta-value")
        twice_data1 = soup2.select_one("#main-content")
        print(twice_data1.find_all("span",class_="f6")[-1].find_next_sibling(text=True))

Which will give you the specific text. If you look at the source:

enter image description here

You can see the text is between the last span.f6 and the next span.f2

Since the f2 seems based on the second link to be consistent, we ca use that instead:

twice_data1.find("span",class_="f2").find_previous_sibling(text=True)

Ok, as there is can be a newline before the f2, we can instead use twice_data1.find_all(text=True, recursive=False) to get the text from the div itself:

data_link = ['https://www.ptt.cc/bbs/Gossiping/M.1119257927.A.60D.html', "https://www.ptt.cc/bbs/Gossiping/M.1467636128.A.23A.html","https://www.ptt.cc/bbs/Gossiping/M.1467638164.A.524.html"]  # 訪問網址 這裡要二次訪問.
data = {"yes": "yes"}
head = {"User-gent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"}
with requests.Session() as s:
    data["from"] = "/bbs/Gossiping/index1.html"
    s.post("https://www.ptt.cc/ask/over18", data=data, headers=head)
    for link in data_link:
        res = s.get(link, headers=head)
        soup2 = BeautifulSoup(res.content, "html.parser")
        body = soup2.find_all(text=True)
        twice_title = soup2.select("span.article-meta-value")
        twice_data1 = soup2.select_one("#main-content")
        print("".join(twice_data1.find_all(text=True, recursive=False)))

Upvotes: 1

Related Questions