alemosie
alemosie

Reputation: 335

How can I sort datetime columns by row value in a Pandas dataframe?

I'm new to Python and Pandas, and I've pulled in a database table that contains 15+ different datetime columns. My task is to sort these columns generally by earliest to latest value in the rows. However, the data is not clean; sometimes, where Column A's date would come before Column B's date in Row 0, A would come after B in Row 1.

I wrote a few functions (redacted here for simplicity) that compare two columns by calculating the percentage of times dates in A come before and after B, and then sorting the columns based on that percentage:

def get_percentage(df, df_subset):
    return len(df_subset)/float(len(df))    

def duration_report(df, earlier_column, later_column):   
    results = {}
    td = df[later_column] - df[earlier_column]
    results["Before"] = get_percentage(df, df.loc[td >= pd.Timedelta(0)])
    results["After"] = get_percentage(df, df.loc[td <= pd.Timedelta(0)])
    ind = "%s vs %s" % (earlier_column, later_column)
    return pd.DataFrame(data=results, index=[ind])

def order_date_columns(df, col1, col2):
    before = duration_report(df, col1, col2).Before.values[0]
    after = duration_report(df, col1, col2).After.values[0]
    if before >= after:
        return [col1, col2]
    else:
        return [col2, col1]

My goal with the above code is to programmatically implement the following:

If Col A dates come before Col B dates 50+% of the time, Col A should come before Col B in the list of earliest to latest datetime columns.

The order_date_columns() function successfully sorts two columns into the correct order, but how do I apply this sorting to the 15+ columns at once? I've looked into df.apply(), lambda, and map(), but haven't been able to crack this problem.

Any help (with code clarity/efficiency, too) would be appreciated!

Upvotes: 2

Views: 1039

Answers (2)

mfsiega
mfsiega

Reputation: 2872

Since you're using Python 2.7, you can use the cmp keyword argument to sorted. To get the column names in the order that you're looking for, I would do something like:

# Returns -1 if first_column[i] > second_column[i] more often.
# Returns 1 if vice versa.
# Returns 0 if equal.
# Assumes df[first_column] and df[second_column] are the same length.
def compare_two(first_column, second_column):
    c1_greater_count = 0
    c2_greater_count = 0
    # Iterate over the two columns in the dataframe. df must be in accessible scope.
    for i in range(len(df[first_column])):
        if df[first_column].iloc(i) > df[second_column].iloc[i]:
            c1_greater_count += 1
        elif df[second_column].iloc[i] > df[first_column].iloc[i]:
            c2_greater_count += 1

    if c1_greater_count > c2_greater_count:
        return -1
    if c2_greater_count > c1_greater_count:
        return 1
    return 0

df = get_dataframe_from_somewhere()
relevant_column_names = get_relevant_column_names(df) # e.g., get all the dates.
sorted_column_names = sorted(relevant_column_names, cmp=compare_two)

# sorted_column_names holds the names of the relevant columns,
# sorted according to the given ordering.

I'm sure there's a more Pythonic way to do it, but this should work. Note that for Python 3, you can use the cmp_to_key utility.

Upvotes: 1

Sohier Dane
Sohier Dane

Reputation: 142

If you don't mind taking a bit of a shortcut and using the median of each date column, this should work:

def order_date_columns(df, date_columns_to_sort):
    x = [(col, df[col].astype(np.int64).median()) for col in date_columns_to_sort]
    return [x[0] for x in sorted(x, key=lambda x: x[1])]

Upvotes: 2

Related Questions