creating cache for faster access of list of dicts in python

I am writing a python program for getting the ipaddress of the website by using socket module. Here, i have a list of dicts with n number of websites and numbers.

Here's some sample data:

data_list = [{'website': 'www.google.com', 'n': 'n1'}, {'website': 'www.yahoo.com', 'n': 'n2'}, {'website': 'www.bing.com', 'n': 'n3'}, {'website': 'www.stackoverflow.com', 'n': 'n4'}, {'website': 'www.smackcoders.com', 'n': 'n5'}, {'website': 'www.zoho.com', 'n': 'n6'}, {'website': 'www.quora.com', 'n': 'n7'}, {'website': 'www.elastic.co', 'n': 'n8'}, {'website': 'www.google.com', 'n': 'n9'}, {'website': 'www.yahoo.com', 'n': 'n10'}, {'website': 'www.bing.com', 'n': 'n11'}, {'website': 'www.stackoverflow.com', 'n': 'n12'}, {'website': 'www.smackcoders.com', 'n': 'n13'}, {'website': 'www.zoho.com', 'n': 'n14'}, {'website': 'www.quora.com', 'n': 'n15'}, {'website': 'www.elastic.co', 'n': 'n16'}, {'website': 'www.google.com', 'n': 'n17'}, {'website': 'www.yahoo.com', 'n': 'n18'}, {'website': 'www.bing.com', 'n': 'n19'}, {'website': 'www.stackoverflow.com', 'n': 'n20'}]

Here's my program:

import socket
import time


data_list = [{'website': 'www.google.com', 'n': 'n1'}, {'website': 'www.yahoo.com', 'n': 'n2'}, {'website': 'www.bing.com', 'n': 'n3'}, {'website': 'www.stackoverflow.com', 'n': 'n4'}, {'website': 'www.smackcoders.com', 'n': 'n5'}, {'website': 'www.zoho.com', 'n': 'n6'}, {'website': 'www.quora.com', 'n': 'n7'}, {'website': 'www.elastic.co', 'n': 'n8'}, {'website': 'www.google.com', 'n': 'n9'}, {'website': 'www.yahoo.com', 'n': 'n10'}, {'website': 'www.bing.com', 'n': 'n11'}, {'website': 'www.stackoverflow.com', 'n': 'n12'}, {'website': 'www.smackcoders.com', 'n': 'n13'}, {'website': 'www.zoho.com', 'n': 'n14'}, {'website': 'www.quora.com', 'n': 'n15'}, {'website': 'www.elastic.co', 'n': 'n16'}, {'website': 'www.google.com', 'n': 'n17'}, {'website': 'www.yahoo.com', 'n': 'n18'}, {'website': 'www.bing.com', 'n': 'n19'}, {'website': 'www.stackoverflow.com', 'n': 'n20'}]

field = "website"
action = "append"
max_retry = 1
hit_cache_size = 10
cache = []
d1 = []

for data in data_list:
    temp={}
    for item in data:
        if item ==field:
            if data[item]!="Not available":
                try:
                    ad=socket.gethostbyname(data[item])
                    if len(cache)<hit_cache_size:
                        cache.append({data[item]:ad})
                    else:
                        cache=[]
                    if action=="replace":
                        temp[item]=ad
                    elif action=="append":
                        temp[item]=str([data[item],ad])
                except:
                    count=0
                    while(True):
                        try:
                            ad=socket.gethostbyname(data[item])
                        except:
                            count+=1
                            if count==max_retry:
                                if action=="replace":
                                    temp[item]="Unknown"
                                elif action=="append":
                                    temp[item]=str([data[item],"Unknown"])
                                break
                            else:
                                continue    
            else:
                temp[item]="Not available"
        else:
            temp[item]=data[item]
    temp['timestamp']=time.ctime()   
    d1.append(temp)
print(d1)

Here, d can have millions of websites. Due to this, my code takes more time. so i created a cache to store some websites with their ip there.The cache size is defined in hit_cache_size. If the same website address comes in the list, instead of checking using the socket module, it should first check the cache. If the website address is there, it should get the ip from there and save it. I tried some ways by creating arrays. Eventhough it takes some time. How to make it possible.....

Upvotes: 1

Answers (2)

vdkotian

Reputation: 559

You mentioned that you could have millions of websites, so one way of resolving this would be to go in for frameworks which are specialized in caching. One of such examples would be Redis.

Installing and getting started with redis

Below is just a sample code to SET and GET the data.

import redis

# step 2: define our connection information for Redis
# Replaces with your configuration information
redis_host = "localhost"
redis_port = 6379
redis_password = ""


def hello_redis():
    """Example Hello Redis Program"""

    # step 3: create the Redis Connection object
    try:

        # The decode_repsonses flag here directs the client to convert the responses from Redis into Python strings
        # using the default encoding utf-8.  This is client specific.
        r = redis.StrictRedis(host=redis_host, port=redis_port, password=redis_password, decode_responses=True)

        # step 4: Set the hello message in Redis 
        r.set("msg:hello", "Hello Redis!!!")

        # step 5: Retrieve the hello message from Redis
        msg = r.get("msg:hello")
        print(msg)        

    except Exception as e:
        print(e)


if __name__ == '__main__':
    hello_redis()

Now using the above example you can implement it in your codebase. Below is an example I have written where you can plug-in with minimalistic changes.

def operate_on_cache(operation, **value):
    """Operate on Redis Cache"""
    try:

        # The decode_repsonses flag here directs the client to convert the responses from Redis into Python strings
        # using the default encoding utf-8.  This is client specific.
        r = redis.StrictRedis(host=redis_host, port=redis_port, password=redis_password, decode_responses=True)

        # Set the key value pair
        if operation == 'set':
            msg = r.set("{}:ip".format(value['site_name']), value['ip'])

        #Retrieve the key
        elif operation == 'get':
            msg = r.get('{}:ip'.format(value['site_name']))
        # If not get/set then throw exception.
        return msg
    except Exception as e:
        print(e)


# Snippet of your code where of how you could implement it.


if data[item] != "Not available":
    try:
        if operate_on_cache('get', site_name = data[item]):
            ad = socket.gethostbyname(data[item])
            operate_on_cache('set', site_name=data[item], ip=ad)

This is just the basics of how you could make use of Redis for caching. If you are lookinng for pure python implementation for python try out

cachetools Example of cachetools

Upvotes: 1

uphill

Reputation: 409

In general a cache should be a data structure which is quicker than a array. A array will in worst cases take always as many iterations as it has entries(n) take a look at https://wiki.python.org/moin/TimeComplexity .

E.g.: if you look up the mapping of 'c' here it will take 3 iterations.

entries = [('a', 1), ('b', 2), ('c', 3)]
result = None
for key, val in entries:
   if key == 'c':
      result = val
print(result)

If you want to fasten up access speed to a cache use a python dict. This will give you a much faster access. Usually this will give you an average case of n log n in run-time which is much better. Nice side effect: much better to read as well.

entries = {'a': 1, 'b': 2, 'c': 3}
result = entries['c']

Upvotes: 1

creating cache for faster access of list of dicts in python

Answers (2)

Related Questions