Tom Werner
Tom Werner

Reputation: 319

Merging two datasets in Python efficiently

What would anyone consider the most efficient way to merge two datasets using Python?

A little background - this code will take 100K+ records in the following format:

{user: aUser, transaction: UsersTransactionNumber}, ...

and using the following data

{transaction: aTransactionNumber, activationNumber: assoiciatedActivationNumber}, ...

to create

{user: aUser, activationNumber: assoiciatedActivationNumber}, ...

N.B These are not Python dictionaries, just the closest thing to portraying record format cleanly.

So in theory, all I am trying to do is create a view of two lists (or tables) joining on a common key - at first this points me towards sets (unions etc), but before I start learning these in depth, are they the way to go? So far I felt this could be implemented as:

  1. Create a list of dictionaries and iterate over the list comparing the key each time, however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?

  2. Manipulate the data as an in-memory SQLite Table? Peferrably not as although there is no strict requirement for Python 2.4, it would make life easier.

  3. Some kind of Set based magic?

Clarification

The whole purpose of this script is to summarise, the actual data sets are coming from two different sources. The user and transaction numbers are coming in the form of a CSV as an output from a performance test that is testing email activation code throughput. The second dataset comes from parsing the test mailboxes, which contain the transaction id and activation code. The output of this test is then a CSV that will get pumped back into stage 2 of the performance test, activating user accounts using the activation codes that were paired up.

Apologies if my notation for the records was misleading, I have updated them accordingly.

Thanks for the replies, I am going to give two ideas a try:

Performance isn't overly paramount for me, I just want to try and get into good habits with my Python Programming.

Upvotes: 1

Views: 6128

Answers (4)

S.Lott
S.Lott

Reputation: 391962

Here's a radical approach.

Don't.

You have two CSV files; one (users) is clearly the driver. Leave this alone. The other -- transaction codes for a user -- can be turned into a simple dictionary.

Don't "combine" or "join" anything except when absolutely necessary. Certainly don't "merge" or "pre-join".

Write your application do simply do simple lookups in the other collection.

Create a list of dictionaries and iterate over the list comparing the key each time,

Close. It looks like this. Note: No Sort.

import csv
with open('activations.csv','rb') as act_data:
    rdr= csv.DictReader( act_data)
    activations = dict( (row['user'],row) for row in rdr )
with open('users.csv','rb') as user_data:
    rdr= csv.DictReader( user_data )
    with open( 'users_2.csv','wb') as updated_data:
        wtr= csv.DictWriter( updated_data, ['some','list','of','columns'])
        for user in rdr:
             user['some_field']= activations[user['user_id_column']]['some_field']
             wtr.writerow( user )

This is fast and simple. Save the dictionaries (use shelve or pickle).

however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?

False.

One list is the "driving" list. The other is the lookup list. You'll drive by iterating through users and lookup appropriate values for transaction. This is O( n ) on the list of users. The lookup is O( 1 ) because dictionaries are hashes.

Upvotes: 6

kriss
kriss

Reputation: 24207

This looks like a typical use for dictionaries with transaction number as key. But you don't have to create the common structure, just build the lookup dictionnaries and use them as needed.

Upvotes: 1

Drakosha
Drakosha

Reputation: 12165

I'd create a map myTransactionNumber -> {transaction: myTransactionNumber, activationNumber: myActivationNumber} and then iterate on {user: myUser, transaction: myTransactionNumber} entries and search in the map for needed myTransactionNumber. The complexity of a search should be O(log N) where N is amount of the entries in the set. So the overal complexity would be O(M*log N) where M is amount of user entries.

Upvotes: 0

Aaron Digulla
Aaron Digulla

Reputation: 328760

Sort the two data sets by transaction number. That way, you always only need to keep one row of each in memory.

Upvotes: 1

Related Questions