Reputation: 3094

Data structures with Python

Python has a lot of convenient data structures (lists, tuples, dicts, sets, etc) which can be used to make other 'conventional' data structures (Eg, I can use a Python list to create a stack and a collections.dequeue to make a queue, dicts to make trees and graphs, etc).

There are even third-party data structures that can be used for specific tasks (for instance the structures in Pandas, pytables, etc).

So, if I know how to use lists, dicts, sets, etc, should I be able to implement any arbitrary data structure if I know what it is supposed to accomplish?

In other words, what kind of data structures can the Python data structures not be used for?

Thanks

Upvotes: 2

Answers (4)

Ray Lutz

Reputation: 79

It is important to realize that Python can represent hierarchical structures which are combinations of list (or tuple) and dict. For example, list-of-dict or dict-of-list or dict-of-dict are common simple structures. These are so common, that in my own practice, I append the data type to the variable name, like 'parameters_lod'. But these can go on to arbitrary depth.

The _lod datatype can be easily converted into a pandas DataFrame, for example, or any database table. In fact, some realizations of big-data tables use the _lod structure, sometimes omitting the commas between each dict and omitting the surrounding list brackets []. This makes it easy to append to a file of such lines. AWS offers tables that are dict syntax.

A _lod can be easily converted to a _dod if there is a field that is unique and can be used to index the table. An important difference between _lod and _dod is that the _lod can have multiple entries for the same keyfield, whereas a dict is required to have only one. Thus, it is more general to start with the _lod as the primary basic table structure so duplicates are allowed until the table is inspected to combine those entries.

If the lod is turned into dod, it is preferred to keep the entire dict intact, and not remove the item that is used for the keyfield.

a_lod = [
    {'k': 'sam', 'a': 1, 'b': 2, 'c': 3},
    {'k': 'sue', 'a': 4, 'b': 5, 'c': 6},
    {'k': 'joe', 'a': 7, 'b': 8, 'c': 9}
    ]

a_dod = {'sam': {'k': 'sam', 'a': 1, 'b': 2, 'c': 3},
         'sue': {'k': 'sue', 'a': 4, 'b': 5, 'c': 6},
         'joe': {'k': 'joe', 'a': 7, 'b': 8, 'c': 9}
        }

Thus, the dict key is added but the records are unchanged. We find this is a good practice so the underlying dicts are unchanged.

Pandas DataFrame.append() function is very slow. Therefore, you should not construct a dataframe one record at a time using this syntax:

df = df.append(record)

Instead, build it as a lod and then convert to a df, as follows.

df = pd.DataFrame.from_dict(lod)

This is much faster, as the other method will get slower and slower as the df grows, because the whole df is copied each time.

It has become important in our development to emphasize the use of _lod and avoid field names in each record that are not consistent, so they can be easily converted to Dataframe. So we avoid using key fields in dicts like 'sam':(data) and use {'name':'sam', 'dataname': (arbitrary data)} instead.

The most elegant thing about python structures is the fact that the default is to work with references to the data rather than values. This must be understood because modifying data in a reference will modify the larger structure.

If you want to make a copy, then you need to use .copy() and sometimes .deepcopy or .copy(deep=True) when using Pandas. Then the data structure will be copied, otherwise, a variable name is just a reference.

Further, we discourage using the dol structure, and instead prefer the lodol of dodol. This is because it is best to have each data item identified with a label, which also allows additional fields to be added.

Upvotes: 0

Srikar Appalaraju

Reputation: 73718

For some simple data structures (eg. a stack), you can just use the builtin list to get your job done. With more complex structures (eg. a bloom filter), you'll have to implement them yourself using the primitives the language supports.

You should use the builtins if they serve your purpose really since they're debugged and optimised by a horde of people for a long time. Doing it from scratch by yourself will probably produce an inferior data structure. Whether you're using Python, C++, C#, Java, whatever, you should always look to the built in data structures first. They will generally be implemented using the same system primitives you would have to use doing it yourself, but with the advantage of having been tried and tested.

Combinations of these data structures (and maybe some of the functions from helper modules such as heapq and bisect) are generally sufficient to implement most richer structures that may be needed in real-life programming; however, that's not invariably the case.

Only when the provided data structures do not allow you to accomplish what you need, and there isn't an alternative and reliable library available to you, should you be looking at building something from scratch (or extending what's provided).

Lets say that you need something more than the rich python library provides, consider the fact that an object's attributes (and items in collections) are essentially "pointers" to other objects (without pointer arithmetic), i.e., "reseatable references", in Python just like in Java. In Python, you normally use a None value in an attribute or item to represent what NULL would mean in C++ or null would mean in Java.

So, for example, you could implement binary trees via, e.g.:

class Node(object):

  __slots__ = 'data', 'left', 'right'

  def __init__(self, data=None, left=None, right=None):
    self.data  = data
    self.left  = left
    self.right = right

plus methods or functions for traversal and similar operations (the __slots__ class attribute is optional -- mostly a memory optimization, to avoid each Node instance carrying its own __dict__, which would be substantially larger than the three needed attributes/references).

Other examples of data structures that may best be represented by dedicated Python classes, rather than by direct composition of other existing Python structures, include tries (see e.g. here) and graphs (see e.g. here).

Upvotes: 4

Adam Mihalcin

Reputation: 14478

You can use the Python data structures to do anything you like. The entire programming language Lisp (now people use either Common Lisp or Scheme) is built around the linked list data structure, and Lisp programmers can build any data structure they choose.

That said, there are sometimes data structures for which the Python data structures are not the best option. For instance, if you want to build a splay tree, you should either roll your own or use an open-source project like pysplay. If the built-in data structures, solve your problem, use them. Otherwise, look beyond the built-in data structures. As always, use the best tool for the job.

Upvotes: 1

Amber

Reputation: 527558

Given that all data structures exist in memory, and memory is effectively just a list (array)... there is no data structure that couldn't be expressed in terms of the basic Python data structures (with appropriate code to interact with them).

Upvotes: 0

Data structures with Python

Answers (4)

Related Questions