rdell
rdell

Reputation: 35

handling 900 000 items in a list "quickly" when needed to access it multiple times?

I've got a use case where I'm pulling in "links" to files on a server file share.

I then need to run some regex checks on these links and break them out into specific pieces so I can sort them.

Once sorted I need to split the list among x amount of servers to start pulling them. The sorting is important as each split needs to be even.

 ['/local/custom_name/a_database/database_name_56_1118843/file_1.tgz']
 ['/local/custom_name/a_database/database_name_56_1118843/file_3.tgz']
 ['/local/custom_name/a_database/database_name_56_1118843/file_4.tgz']
 ['/local/custom_name/a_database/database_name_56_1118843/file_2.tgz']

 ['/local/custom_name/a_database/database_name_655_1118843/file_1.tgz']
 ['/local/custom_name/a_database/database_name_655_1118843/file_2.tgz']
 ['/local/custom_name/a_database/database_name_655_1118843/file_3.tgz']
 ['/local/custom_name/a_database/database_name_655_1118843/file_4.tgz']
 
 ['/local/custom_name_4/b_database/database_name_5242_11132428843/file_1.tgz']
 ['/local/custom_name_4/b_database/database_name_5242_11132428843/file_2.tgz']

 ['/shared/custom_name/c_database/database_name_56_1118843/file_1.tgz']
 ['/shared/custom_name/c_database/database_name_56_1118843/file_2.tgz']
 
 ['/local/custom_name_4/c_database/database_name_58_1118843/file_1.tgz']
 ['/local/custom_name/ac_database/database_name_58_1118843/file_2.tgz']

For example since there are 8 files in a_database and and 4 per name, then say 4 servers, I'd need one file from each path to go to each server.

What I did was look at each link and then break out the path into a dictionary where the first value is a unique id:

{'uid' : 'local_custom_name_a_database_database_name_56', 'link_list': [] }

Then after I go through the original list again and add any links that fit to the dict:

{'uid' : 'local_custom_name_a_database_database_name_56', 'link_list': [ 
'/local/custom_name/a_database/database_name_56_1118843/file_1.tgz', 
'/local/custom_name/a_database/database_name_56_1118843/file_3.tgz', 
'/local/custom_name/a_database/database_name_56_1118843/file_4.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_2.tgz'
]}

Then split the link_list among the servers.

All of this works as intended however the second part, where I compare the original link to the new dictionary uid and add the link to the list takes forever. With 10 000 items it takes a couple mins, but with 900 000 items, it looks like it'll take around 125 hours. which isn't ok.

The real data is more complicated and there's significant sorting going on, but that isn't the bottle neck. The bottleneck is where described. While the logic works I'm certain I'm not doing this in the most efficient way.

Any help is appreciated. Even just pointing me in the direction of a better way to handle this many items outside of native lists and lists of lists or dicts.

Upvotes: 2

Views: 84

Answers (1)

iz_
iz_

Reputation: 16593

If performance is a concern, this type of data structure {'uid' : 'local_custom_name_a_database_database_name_56', 'link_list': [] } is going to be a problem. It's O(n) to find an element based on UID. Instead, you need a dictionary mapping the UID directly to the link list. This allows O(1) access. If needed, you can transform the data later.

I don't know the exact logic behind getting the UIDs, so I just have an example one:

l = [
    '/local/custom_name/a_database/database_name_56_1118843/file_1.tgz',
    '/local/custom_name/a_database/database_name_56_1118843/file_3.tgz',
    '/local/custom_name/a_database/database_name_56_1118843/file_4.tgz',
    '/local/custom_name/a_database/database_name_56_1118843/file_2.tgz',
    '/local/custom_name/a_database/database_name_655_1118843/file_1.tgz',
    '/local/custom_name/a_database/database_name_655_1118843/file_2.tgz',
    '/local/custom_name/a_database/database_name_655_1118843/file_3.tgz',
    '/local/custom_name/a_database/database_name_655_1118843/file_4.tgz',
    '/local/custom_name_4/b_database/database_name_5242_11132428843/file_1.tgz',
    '/local/custom_name_4/b_database/database_name_5242_11132428843/file_2.tgz',
    '/shared/custom_name/c_database/database_name_56_1118843/file_1.tgz',
    '/shared/custom_name/c_database/database_name_56_1118843/file_2.tgz',
    '/local/custom_name_4/c_database/database_name_58_1118843/file_1.tgz',
    '/local/custom_name/ac_database/database_name_58_1118843/file_2.tgz',
]

def getUid(s):
    return s[1:].rpartition("/")[0].rpartition("_")[0].replace("/", "_")

result = {}
for s in l:
    result.setdefault(getUid(s), []).append(s)

print(result)
{'local_custom_name_a_database_database_name_56': ['/local/custom_name/a_database/database_name_56_1118843/file_1.tgz',
  '/local/custom_name/a_database/database_name_56_1118843/file_3.tgz',
  '/local/custom_name/a_database/database_name_56_1118843/file_4.tgz',
  '/local/custom_name/a_database/database_name_56_1118843/file_2.tgz'],
 'local_custom_name_a_database_database_name_655': ['/local/custom_name/a_database/database_name_655_1118843/file_1.tgz',
  '/local/custom_name/a_database/database_name_655_1118843/file_2.tgz',
  '/local/custom_name/a_database/database_name_655_1118843/file_3.tgz',
  '/local/custom_name/a_database/database_name_655_1118843/file_4.tgz'],
 'local_custom_name_4_b_database_database_name_5242': ['/local/custom_name_4/b_database/database_name_5242_11132428843/file_1.tgz',
  '/local/custom_name_4/b_database/database_name_5242_11132428843/file_2.tgz'],
 'shared_custom_name_c_database_database_name_56': ['/shared/custom_name/c_database/database_name_56_1118843/file_1.tgz',
  '/shared/custom_name/c_database/database_name_56_1118843/file_2.tgz'],
 'local_custom_name_4_c_database_database_name_58': ['/local/custom_name_4/c_database/database_name_58_1118843/file_1.tgz'],
 'local_custom_name_ac_database_database_name_58': ['/local/custom_name/ac_database/database_name_58_1118843/file_2.tgz']}

Then, if needed:

transformed = [{"uid": k, "link_list": v} for k, v in result.items()]
print(transformed)
[{'uid': 'local_custom_name_a_database_database_name_56',
  'link_list': ['/local/custom_name/a_database/database_name_56_1118843/file_1.tgz',
   '/local/custom_name/a_database/database_name_56_1118843/file_3.tgz',
   '/local/custom_name/a_database/database_name_56_1118843/file_4.tgz',
   '/local/custom_name/a_database/database_name_56_1118843/file_2.tgz']},
 {'uid': 'local_custom_name_a_database_database_name_655',
  'link_list': ['/local/custom_name/a_database/database_name_655_1118843/file_1.tgz',
   '/local/custom_name/a_database/database_name_655_1118843/file_2.tgz',
   '/local/custom_name/a_database/database_name_655_1118843/file_3.tgz',
   '/local/custom_name/a_database/database_name_655_1118843/file_4.tgz']},
 {'uid': 'local_custom_name_4_b_database_database_name_5242',
  'link_list': ['/local/custom_name_4/b_database/database_name_5242_11132428843/file_1.tgz',
   '/local/custom_name_4/b_database/database_name_5242_11132428843/file_2.tgz']},
 {'uid': 'shared_custom_name_c_database_database_name_56',
  'link_list': ['/shared/custom_name/c_database/database_name_56_1118843/file_1.tgz',
   '/shared/custom_name/c_database/database_name_56_1118843/file_2.tgz']},
 {'uid': 'local_custom_name_4_c_database_database_name_58',
  'link_list': ['/local/custom_name_4/c_database/database_name_58_1118843/file_1.tgz']},
 {'uid': 'local_custom_name_ac_database_database_name_58',
  'link_list': ['/local/custom_name/ac_database/database_name_58_1118843/file_2.tgz']}]

Upvotes: 1

Related Questions