TheM00s3
TheM00s3

Reputation: 3721

Map pandas dataframe column to a matrix

The following operation

import pandas as pd
import numpy as np
data = pd.read_csv(fname,sep=",",quotechar='"')

will create a 650,000 x 9 dataframe. The first column contains dates and the following is designed to turn a single date stamp and turn it into 5 seperate features.

def timepartition(elm):
    tm = time.strptime(elm,"%Y-%m-%d %H:%M:%S")
    return tm[0], tm[1], tm[2], tm[3], tm[4]

data["Dates"].map(timepartition)

What I would like is to assign those 5 values to a 650,000x7 np matrix.

xtrn = np.zeros(shape=(data.shape[0],7))
xtrn[:,0:4] = np.asarray(data["Dates"].map(timepartition)) 
#above returns error ValueError: could not broadcast input array from shape (650000) into shape (650000,4)

Upvotes: 0

Views: 1063

Answers (3)

ontologist
ontologist

Reputation: 605

You might try using some of the builtin pandas features.

dates = pd.to_datetime(data['Dates'])
date_df = pd.DataFrame(dict(
    year=dates.dt.year,
    month=dates.dt.month,
    day=dates.dt.day,
    # etc.
))
xtrn[:, :5] = date_df.values  # use date[['year', 'month', 'day', etc.]] if the order comes out wrong

Upvotes: 1

Bill Harper
Bill Harper

Reputation: 378

The following worked for me. I'm not sure which method is faster, but it was easier for me to understand logically what's going on. Here my dataset "crimes" is your "data" and our time formats are a bit different.

def timepartition(elm):
    tm = time.strptime(elm,"%m/%d/%Y %H:%M:%S %p")
    return tm[0:5]

zeros = np.zeros(shape=(crimes.shape[0],3), dtype=np.int)
dates = np.array([timepartition(crimes["Date"][i]) for i in range(0,len(crimes))])
new = np.hstack((dates,zeros))

Upvotes: 0

TheM00s3
TheM00s3

Reputation: 3721

The map function applied to a dataframe is mapping to a new series object, and by returning tuples, it will come back as an object series.

Another approach is the following.

make the following change to timepartition:

def timepartition(elm):
    tm = time.strptime(elm,"%Y-%m-%d %H:%M:%S")
    return [tm[i] for i in range(5)]

this will now return a listed of a tuple. The following code will create a matrix from a dataframe series that has the desired dimensions, and map it to xtrn.

xtrn[:,0:5] = = np.matrix(map(timepartition, data["Dates"].tolist()))

np matrix will infer a matrix from the nested lists from applying the partitioning function from the data to a list representation of the series, which is flat in this case.

Upvotes: 0

Related Questions