salvu
salvu

Reputation: 519

Python Replace throwing errors when replacing "</html>"

I am very new to Python and I'm trying to understand and use the script from this link in Anaconda running on Python 3.5.2. I have had to change some things so that the script can run in this version of Python since it is from 2013. The script (as amended by inexperienced me) is as follows and my problem is in the try block in the line html = f.read().replace("</html>", "") + "</html>".

I simply cannot understand the reason of the + "</html>" that comes after the close parenthesis. From what I have found out on the replace() method is that it takes at least two parameters, the old character/s and the new ones. As it is, this script is jumping to the except Exception as e: and prints out a bytes-like object is required, not 'str'.

Now this is, as far as I can tell, because the reading is being done as bytes whereas the replace method takes strings. I tried to divide the line into:

html = f.read
html = str.replace("</html>", "") + "</html>"

but this throws replace() takes at least 2 arguments (1 given). I also tried changing the contents of html from bytes to str as follows

html = str(f.read(), 'utf-8')
html = str.replace("</html>", "")

but this also returns the error that replace() takes two arguments (1 given). When I removed the html = str.replace("</html>", "") + "</html>" altogether and so skipped to the soup = BeautifulSoup(html), I ended up with a warning that no parser was explicitly specified and later on an AttributeError that NoneType object has no attribute get_dictionary.

Any help about the need for the mentioned line and why it is used and how to use it would be greatly appreciated. Thank you.

#!/usr/bin/python

import sys
import urllib.request
import re
import json

from bs4 import BeautifulSoup

import socket

socket.setdefaulttimeout(10)

cache = {}

for line in open(sys.argv[1]):
fields = line.rstrip('\n').split('\t')
sid = fields[0]
uid = fields[1]

# url = 'http://twitter.com/%s/status/%s' % (uid, sid)
# print url
tweet = None
text = "Not Available"
if sid in cache:
    text = cache[sid]
else:
    try:
        f = urllib.request.urlopen("http://twitter.com/%s/status/%s" % (uid, sid))
        print('URL: ', f.geturl())
        # Thanks to Arturo!
        # html = f.read()
        html = f.read().replace("</html>", "") + "</html>"
        soup = BeautifulSoup(html)
        jstt = soup.find_all("p", "js-tweet-text")
        tweets = list(set([x.get_text() for x in jstt]))
        # print len(tweets)
        # print tweets
        if (len(tweets)) > 1:
            continue

        text = tweets[0]
        cache[sid] = tweets[0]

        for j in soup.find_all("input", "json-data", id="init-data"):
            js = json.loads(j['value'])
            if js.has_key("embedData"):
                tweet = js["embedData"]["status"]
                text = js["embedData"]["status"]["text"]
                cache[sid] = text
                break
    except Exception as e:
        print(e)
        # except Exception as e:
        continue

    if tweet is not None and tweet["id_str"] != sid:
        text = "Not Available"
        cache[sid] = "Not Available"
    text = text.replace('\n', ' ', )
    text = re.sub(r'\s+', ' ', text)
    # print json.dumps(tweet, indent=2)
    print("\t".join(fields + [text]).encode('utf-8'))

Upvotes: 1

Views: 593

Answers (2)

salvu
salvu

Reputation: 519

I have accepted the solution kindly given by @DeepSpace as an answer as it helped me to realise how to overcome the problem I was facing. The code below can now execute under Python 3 if run from command prompt as follows (Please note that I executed this from Windows command prompt):

python download_tweets.py inpuot_file.tsv > output_file.tsv. The code follows:

#!/usr/bin/python

import sys
import urllib.request
import re
import json

from bs4 import BeautifulSoup

import socket

socket.setdefaulttimeout(10)

cache = {}

for line in open(sys.argv[1]):
    fields = line.rstrip('\n').split('\t')
    sid = fields[0]
    uid = fields[1]

    tweet = None
    text = "Not Available"
    if sid in cache:
        text = cache[sid]
    else:
        try:
           f = urllib.request.urlopen("http://twitter.com/%s/status/%s" % (uid, sid))
           # print('URL: ', f.geturl())
           # Thanks to Arturo!
           html = str.replace(str(f.read(), 'utf-8'), "</html>", "")
           # html = f.read().replace("</html>", "") + "</html>" # original line
           soup = BeautifulSoup(html, "lxml") # added "lxml" as it was giving warnings
          jstt = soup.find_all("p", "js-tweet-text")
          tweets = list(set([x.get_text() for x in jstt]))
          # print(len(tweets))
          if (len(tweets)) > 1:
            continue

          text = tweets[0]
          cache[sid] = tweets[0]

          for j in soup.find_all("input", "json-data", id="init-data"):
              js = json.loads(j['value'])
              if "embedData" in js:
                 # if js.has_key("embedData"): # original line
                 tweet = js["embedData"]["status"]
                 text = js["embedData"]["status"]["text"]
                 cache[sid] = text
                 break
        except Exception as e:
            print(e)
            continue

        if tweet is not None and tweet["id_str"] != sid:
            text = "Not Available"
            cache[sid] = "Not Available"
        text = text.replace('\n', ' ', )
        text = re.sub(r'\s+', ' ', text)
        # print(json.dumps("dump: ", tweet, indent=2))
        print(" \t ".join(fields + [text]).encode('utf-8'))

Upvotes: 0

DeepSpace
DeepSpace

Reputation: 81614

str.replace is using replace in its static form (calling the method from the type-class str instead of an str object).

str.replace will actually need 3 arguments: the string to act on, the char or string to replace and the new char or string.

'abcd'.replace('d', 'z') is equivallent to str.replace('abcd', 'd', 'z'):

print('abcd'.replace('d', 'z'))
# abcz
print(str.replace('abcd', 'd', 'z'))
# abcz

Upvotes: 2

Related Questions