GJ.
GJ.

Reputation: 5364

Many-to-many data structure in Python

I have a data set of books and authors, with a many-to-many relationship.

There are about 10^6 books and 10^5 authors, with an average of 10 authors per book.

I need to perform a series of operations on the data set, such as counting the number of books by each author or deleting all books by a certain author from the set.

What would be a good data structure that will allow fast handling?

I'm hoping for some ready made module that can provide methods along the lines of:

obj.books.add(book1)

# linking
obj.books[n].author = author1
obj.authors[m].author = book1

# deleting
obj.remove(author1) # should automatically remove all links to the books by author1, but not the linked books

I should clarify that I prefer not to use a database for this, but to do it all in memory.

Thanks

Upvotes: 10

Views: 7359

Answers (4)

Julienm
Julienm

Reputation: 198

Hmm, I don't think you need any 3rd party or external database if you don't want to persist your data and want a pure Python solution. It'd be faster for you to :

  • Assign a unique ID to all your books and authors (using e.g. a counter)
  • Manage your objects mapped in their respective dictionaries
  • Make a many-to-many relationship by managing 2 one-to-many associations
    from typing import Dict, List
    
    # equivalent to tables
    books: Dict[int, str] = {}
    authors: Dict[int, str] = {}
    # equivalent to a many-to-many relationship
    book_to_author_map: Dict[int, List[int]] = {}
    author_to_book_map: Dict[int, List[int]] = {}
    
    # your database objects
    books[0] = 'my first book'
    books[1] = 'my second book'
    books[2] = 'my third book'
    authors[0] = 'my first author'
    authors[1] = 'my second author'
    authors[2] = 'my third author'
    
    book_to_author_map[0] = [0]
    book_to_author_map[1] = [1, 2]
    book_to_author_map[2] = [0, 2]
    
    author_to_book_map[0] = [0, 2]
    author_to_book_map[1] = [1]
    author_to_book_map[2] = [1, 2]
    
    # operations on your "database"
    
    # add a book 3 and associate it to author 0
    books[3] = 'my fourth book'
    book_to_author_map[3] = []
    book_to_author_map[3].append(0)
    author_to_book_map[0].append(3)
    
    # remove book 1 from author 2
    book_to_author_map[1].remove(2)
    author_to_book_map[2].remove(1)

Upvotes: 1

Adrian Keister
Adrian Keister

Reputation: 1035

I would just use pandas for it all. It can handle many-to-many relationships just fine. Counts and deletions are quite straight-forward. For example:

import pandas as pd

# Set up the dataframe with books and authors.
df = pd.DataFrame(columns=['author', 'book'])
df.loc[0] = ['John Smith', 'Programming in Python']
df.loc[1] = ['John Doe', 'Programming in Python']
df.loc[2] = ['John Smith', 'Programming in Pandas']
df.loc[3] = ['John Doe', 'Programming in Numpy']
df.loc[4] = ['Jane Doe', 'Programming in Numpy']

# Find all books by John Smith
print(list(df['John Smith' == df['author']]['book'].values))
# Result: ['Programming in Python', 'Programming in Pandas']
# Use the len function to count the number of books.

# Find all authors for 'Programming in Numpy'
print(list(df['Programming in Numpy' == df['book']]['author'].values))
# Result: ['John Doe', 'Jane Doe']

# To drop the John Doe's from the dataframe:
df = df.drop(df['John Doe' == df['author']].index)

Upvotes: 2

S.Lott
S.Lott

Reputation: 392010

I'm hoping for some ready made module that can provide methods along the lines of:

Since that actually works, what more do you need?

You have a Book and an Author class definition. You also have a Book-Author association for the relationships. The methods required to manage add/change/delete are only a few lines of code.

Create big old dictionaries of Authors, Books and Author-Book association objects.

Use shelve to store it all.

Done.

Upvotes: 2

Alex Martelli
Alex Martelli

Reputation: 882691

sqlite3 (or any other good relational DB, but sqlite comes with Python and is handier for such a reasonably small set of data) seems the right approach for your task. If you'd rather not learn SQL, SQLAlchemy is a popular "wrapper" over relational DBs, so to speak, that allows you to deal with them at any of several different abstraction levels of your choice.

And "doing it all in memory" is no problem at all (it's silly, mind you, since you'll needlessly pay the overhead of reading in all the data from somewhere more persistent on each and every run of your program, while keeping the DB on a disk file would save you that overhead -- but, that's a different issue;-). Just open your sqlite database as ':memory:' and there you are -- a fresh, new relational DB living entirely in memory (for the duration of your process only), no disk involved in the procedure at all. So, why not?-)

Personally, I'd use SQL directly for this task -- it gives me excellent control of exactly what's going on, and easily lets me add or remove indices to tweak performance, etc. You'd use three tables: a Books table (primary key ID, other fields such as Title &c), an Authors table (primary key ID, other fields such as Name &c), and a "many-to-many relationship table", say BookAuthors, with just two fields, BookID and AuthorID, and one record per author-book connection.

The two fields of the BookAuthors table are what's known as "foreign keys", referring respectively to the ID fields of Books and Authors, and you can define them with an ON DELETE CASCADE so that records referring to a book or author that gets deleted are automatically dropped in turn -- an example of the high semantic level at which even "bare" SQL lets you work, which no other existing data structure can come close to matching.

Upvotes: 18

Related Questions