Reputation: 469
I am learning Pandas and I came to an interesting question. So I have a Dataframe like this:
COL1 COL2 COL3
a 9/8/2016 2
b 12/4/2016 23
...
n 1/1/2015 21
COL1 is a String, Col2 is a timestamp and Col3 is a number. Now I need to do some analysis on this Dataframe and I want to convert all the non-numeric data to numeric. I tried using DictVectorizer() to convert COL1 and 2 to numeric but first of all I am not sure if this is the best way doing such a thing and second I don't know what to do with the timestamp. When I use DictVectorizer the output would be like:
{u'COL3: {0:2, 1:23 , ...,n:21}, 'COL1': {0: u'a', 1:'b', ... , n:'n'}, 'COL2': {0: u'9/8/2016' , 1: u'12/4/2016' , ... , n:u'1/1/2016'}}
but from what I learned it should be like this or at least I know I need something like this:
{COL1:'a', COL2: '9/8/2016' , COL3: 2 and so on}
so, questions: 1-what is the best way of converting non- numeric (including date) to numeric values to use in sklearn libraries 2- what is the right way of using DictVectorize()
Any help would be appreciated.
Upvotes: 3
Views: 13923
Reputation: 1
Bring all values between [0,1]
scaler = MinMaxScaler()
df_norm = scaler.fit_transform(df_enc)
df_norm = pd.DataFrame(df_norm,columns = df_enc.columns)
Upvotes: -1
Reputation: 294218
for strings
consider
s = pd.Series(list('abcdefagbfhickjjmshh'))
# get unique values
u = s.unique()
# gen a mapping series
m = pd.Series(range(len(u)), u)
# encode
s.map(m)
0 0
1 1
2 2
3 3
4 4
5 5
6 0
7 6
8 1
9 5
10 7
11 8
12 2
13 9
14 10
15 10
16 11
17 12
18 7
19 7
dtype: int64
for dates
there are two ways to do it. If you don't care about the relative position of the dates, use the same exact method as for strings. Otherwise
consider
d = pd.date_range('2015-03-31', periods=20).to_series().reset_index(drop=True)
pd.Series(d.values.astype(np.timedelta64), d.index, np.int64)
0 1427760000000000000
1 1427846400000000000
2 1427932800000000000
3 1428019200000000000
4 1428105600000000000
5 1428192000000000000
6 1428278400000000000
7 1428364800000000000
8 1428451200000000000
9 1428537600000000000
10 1428624000000000000
11 1428710400000000000
12 1428796800000000000
13 1428883200000000000
14 1428969600000000000
15 1429056000000000000
16 1429142400000000000
17 1429228800000000000
18 1429315200000000000
19 1429401600000000000
dtype: int64
Upvotes: 0
Reputation: 131
To encode non-numeric data to numeric you can use scikit-learn's LabelEncoder. It will encode each category such as COL1's a
, b
, c
to integers.
Assuming df is your dataframe, try:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
enc.fit(df['COL1'])
df['COL1'] = enc.transform(df['col1'])
enc.fit()
creates the corresponding integer values.enc.transform()
applies the encoding to the df values.For the second column, using Pandas to_datetime() function should do the trick, like @quinn-weber mentioned, try:
df['COL2'] = pd.to_datetime(df['COL2'])
Upvotes: 6
Reputation: 927
You could convert COL1 with something like this:
import pandas as pd
import string
table = pd.DataFrame([
['a','9/8/2016',2],
['b','12/4/2016',23],
['n','1/1/2015',21],
], columns=['COL1', 'COL2', 'COL3'])
table['COL1'] = table['COL1'].map(dict(zip(list(string.lowercase), xrange(0,25))))
As for the timestamp, you could do:
table['COL2'] = pd.to_datetime(
table['COL2'], format='%m/%d/%Y'
).dt.strftime(date_format='%Y%m%d')
Upvotes: 1