bladexeon
bladexeon

Reputation: 706

Reading URL socket backwards in Python

I'm attempting to pull information from a log file posted online and read through the output. The only information i really need is posted at the end of the file. These files are pretty big and storing the entire socket output to a variable and reading through it is consuming alot of internal memory. is there a was to read the socket from bottom to top?

What I currently have:

socket = urllib.urlopen(urlString)
OUTPUT = socket.read()
socket.close()
OUTPUT = OUTPUT.split("\n")
for line in OUTPUT:
    if "xxxx" in line:
        print line

I am using Python 2.7. I pretty much want to read about 30 lines from the very end of the output of Socket.

Upvotes: 2

Views: 146

Answers (1)

Alex G Rice
Alex G Rice

Reputation: 1579

What you want in this use case is the HTTP Range request. Here is tutorial I located:

http://stuff-things.net/2015/05/13/web-scale-http-tail/

I should clarify: the advantage of getting the size with a Head request, then doing a Range request, is that you do not have to transfer all the content. You mentioned you have pretty big file resources, so this is going to be the best solution :)

edit: added this code below...

Here is a demo (simplified) of that blog article, but translated into Python. Please note this will not work with all HTTP servers! More comments inline:

"""
illustration of how to 'tail' a file using http. this will not work on all
webservers! if you need an http server to test with you can try the
rangehttpserver module:

$ pip install requests
$ pip install rangehttpserver
$ python -m RangeHTTPServer
"""
import requests

TAIL_SIZE = 1024

url = 'http://localhost:8000/lorem-ipsum.txt'
response = requests.head(url)

# not all servers return content-length in head, for some reason
assert 'content-length' in response.headers, 'Content length unknown- out of luck!'

# check the the resource length and construct a request header for that range
full_length = int(response.headers['content-length'])
assert full_length > TAIL_SIZE
headers = {
  'range': 'bytes={}-{}'.format( full_length - TAIL_SIZE, full_length)
}

# Make a get request, with the range header
response = requests.get(url, headers=headers)
assert 'accept-ranges' in response.headers, 'Accept-ranges response header missing'
assert response.headers['accept-ranges'] == 'bytes'
assert len(response.text) == TAIL_SIZE

# Otherwise you get the entire file
response = requests.get(url)
assert len(response.text) == full_length

Upvotes: 2

Related Questions