Sayed Gouda
Sayed Gouda

Reputation: 615

python pandas date time output dates the same

The purpose of this code is to read a CSV file, the file has five columns ['Release Date', 'Time', 'Actual', 'Forecast', 'Previous'], and the 'Release Date' column has two shapes of dates:

• Sep 09, 2018 (Aug)

• Sep 24, 2018

So I can NOT simply scrape dates because of the mismatched date shape, so I decided to create a new column based on 'Release Date', 'Time' columns then add it to the original data frame.

I tried this code:

import pandas as pd
df = pd.read_csv(r"C:\Users\Sayed\Desktop\script\data.csv")
for date, time in zip(df['Release Date'], df['Time']):
    Date = date[:12] + ' ' + time
    df['Date'] = Date
print(df.head())

But I got this output:

Release Date Time Actual Forecast Previous Date

Oct 15, 2018 (Sep) 21:30 0.5% 0.7% Feb 01, 1996 05:00

Sep 09, 2018 (Aug) 21:30 0.7% 0.5% 0.3% Feb 01, 1996 05:00

Aug 08, 2018 (Jul) 21:30 0.3% 0.2% -0.1% Feb 01, 1996 05:00

Jul 09, 2018 (Jun) 21:30 -0.1% 0.1% -0.2% Feb 01, 1996 05:00

Jun 08, 2018 (May) 21:30 -0.2% -0.1% -0.2% Feb 01, 1996 05:00

Upvotes: 1

Views: 152

Answers (3)

IMCoins
IMCoins

Reputation: 3306

I'm not a fan of the .apply() method in pandas as it is really not efficient.

Here is another solution of mine to help you to deal with your problem efficiently. I also made a benchmark to prove that .apply() is really inefficient. And when it comes to Big Data, you must use it only if necessary.

df['Date'] = df.loc[:, 'Release Date'][:12] + ' ' + df['Time']

This line means : Take all the Index from 0 to 12 (excluded), from all the rows at column 'Release Date', add a space to it, add column 'Time' to it (implicitly meaning all the rows).

import pandas as pd
import timeit
from matplotlib import pyplot as plt

def IMCoins(df):
    df['Date'] = df.loc[:, 'Release Date'][:12] + ' ' + df['Time']

def petezurich(df):
    df['Date'] = df['Release Date'].apply(lambda x: x[:12]) + ' ' + df['Time']

def benchmark(x_ticks, time_arr_1, time_arr_2):
    """ Displays difference between all the time_arr.
    """
    X = range(len(time_arr_1))

    plt.figure()
    plt.plot(X, time_arr_1, marker='o', color='g', label='IMCoins')
    plt.plot(X, time_arr_2, marker='o', color='r', label='petezurich')
    plt.ylabel('Time in seconds')
    plt.xlabel('Number of elements to iterate on')
    plt.xticks( [nb for nb in range(len(x_ticks))], x_ticks, rotation=30)
    plt.legend()
    plt.tight_layout()
    plt.show()

if __name__ == '__main__':
    #   Iterations are the number of tests run by timeit.
    n_iter = 10

    #   Elements modifies the shape of the DataFrame
    n_elements = 10

    #   Number of time n_elements will get multiplied by factor.
    n_increase = 7
    factor = 10

    time_arr_1, time_arr_2, x_ticks = [], [], []
    for idx in range(n_increase):
        #   Preparing data inside the loop because we need to
        #   increase its size.
        data = {
            'Release Date' : ['a' * 20 for _ in range(n_elements)],
            'Time' : ['b' * 10 for _ in range(n_elements)]
        }
        df = pd.DataFrame(data)

        #   We check the both functions are giving the same results.
        assert IMCoins(df) == petezurich(df), 'results are different'

        t1 = timeit.timeit(stmt = 'IMCoins(df)',
                           setup = 'from __main__ import df, IMCoins',
                           number= n_iter)
        time_arr_1.append(t1)

        t2 = timeit.timeit(stmt = 'petezurich(df)',
                           setup = 'from __main__ import df, petezurich',
                           number = n_iter)
        time_arr_2.append(t2)

        #   We want to correctly display the number of elements computer on
        #   some later plots.
        x_ticks.append(n_elements)

        # In order to increase the data...
        n_elements *= factorx

    benchmark(x_ticks, time_arr_1, time_arr_2)

benchmark

Upvotes: 0

petezurich
petezurich

Reputation: 10184

Your loop is wrong and unnecessary.

Try this:

df["Date"] = df["Release Date"].apply(lambda x: x[:12]) + " " + df["Time"]

Upvotes: 1

fabio.avigo
fabio.avigo

Reputation: 308

This line of code:

df['Date'] = Date

Changes EVERY row in the column 'Date' to receive the value of your last Date input, on every iteration of the loop.

Try using it as a lambda function. You'll notice a boost in performance, too:

def GetDate(row):
    return row['Release Date'][:12] + ' ' + row['Time']

df['Date'] = df.apply(lambda x: GetDate(x), axis=1)

Upvotes: 1

Related Questions