Shukuang Chen
Shukuang Chen

Reputation: 23

python crawler ieee paper keywords

i trying to use crawler to get ieee paper keywords but now i get a error how can to fix my crawler? my code is here

import requests
import json
from bs4 import BeautifulSoup
ieee_content = requests.get("http://ieeexplore.ieee.org/document/8465981", timeout=180)
soup = BeautifulSoup(ieee_content.text, 'xml')
tag = soup.find_all('script')
for i in tag[9]:
    s = json.loads(re.findall('global.document.metadata=(.*;)', i)[0].replace("'", '"').replace(";", ''))

and error is here

Traceback (most recent call last):
  File "G:/github/爬蟲/redigg-leancloud/crawlers/sup_ieee_keywords.py", line 90, in <module>
    a.get_es_data(offset=0, size=1)
  File "G:/github/爬蟲/redigg-leancloud/crawlers/sup_ieee_keywords.py", line 53, in get_es_data
    self.get_data(link=ieee_link, esid=es_id)
  File "G:/github/爬蟲/redigg-leancloud/crawlers/sup_ieee_keywords.py", line 65, in get_data
    s = json.loads(re.findall('global.document.metadata=(.*;)', i)[0].replace(";", '').replace("'", '"'))
IndexError: list index out of range

Upvotes: 1

Views: 618

Answers (2)

Life is complex
Life is complex

Reputation: 15619

Here's another answer. I don't know what you are doing with 's' in your code after the load (replace) in my code.

The code below doesn't thrown an error, but again how are you using 's'

import requests
import json
from bs4 import BeautifulSoup

ieee_content = requests.get("http://ieeexplore.ieee.org/document/8465981", timeout=180)
soup = BeautifulSoup(ieee_content.text, 'xml')
tag = soup.find_all('script')

# i is a list
for i in tag[9]:
   metadata_format = re.compile(r'global.document.metadata=.*', re.MULTILINE)
   metadata = re.findall(metadata_format, i)
   if len(metadata) != 0:
      # convert the list 
      convert_to_json = json.dumps(metadata)
      x = json.loads(convert_to_json)
      s = x[0].replace("'", '"').replace(";", '')
      ###########################################
      # I don't know what you plan to do with 's'
      ###########################################
      print (s)

Upvotes: 1

Regis May
Regis May

Reputation: 3446

Apparently in line 65 some of the data provided in i did not suite the regex pattern you're trying to use. Therefor your [0] will not work as the data returned is not an array of suitable length.

Solution:

x = json.loads(re.findall('global.document.metadata=(.*;)', i)
if x:
    s = x[0].replace("'", '"').replace(";", ''))

Upvotes: 0

Related Questions