faranak777
faranak777

Reputation: 469

converting non-numeric to numeric value using Panda libraries

I am learning Pandas and I came to an interesting question. So I have a Dataframe like this:

COL1    COL2      COL3
a     9/8/2016     2
b     12/4/2016    23
         ...
n     1/1/2015     21

COL1 is a String, Col2 is a timestamp and Col3 is a number. Now I need to do some analysis on this Dataframe and I want to convert all the non-numeric data to numeric. I tried using DictVectorizer() to convert COL1 and 2 to numeric but first of all I am not sure if this is the best way doing such a thing and second I don't know what to do with the timestamp. When I use DictVectorizer the output would be like:

{u'COL3: {0:2, 1:23 , ...,n:21}, 'COL1': {0: u'a', 1:'b', ... , n:'n'}, 'COL2': {0: u'9/8/2016' , 1: u'12/4/2016' , ... , n:u'1/1/2016'}}

but from what I learned it should be like this or at least I know I need something like this:

 {COL1:'a', COL2: '9/8/2016' , COL3: 2  and so on}   

so, questions: 1-what is the best way of converting non- numeric (including date) to numeric values to use in sklearn libraries 2- what is the right way of using DictVectorize()

Any help would be appreciated.

Upvotes: 3

Views: 13923

Answers (4)

Ginnu Kaur
Ginnu Kaur

Reputation: 1

Bring all values between [0,1]

scaler = MinMaxScaler()

df_norm = scaler.fit_transform(df_enc)
df_norm = pd.DataFrame(df_norm,columns = df_enc.columns)

Upvotes: -1

piRSquared
piRSquared

Reputation: 294218

for strings
consider

s = pd.Series(list('abcdefagbfhickjjmshh'))
# get unique values
u = s.unique()
# gen a mapping series
m = pd.Series(range(len(u)), u)
# encode
s.map(m)

0      0
1      1
2      2
3      3
4      4
5      5
6      0
7      6
8      1
9      5
10     7
11     8
12     2
13     9
14    10
15    10
16    11
17    12
18     7
19     7
dtype: int64

for dates
there are two ways to do it. If you don't care about the relative position of the dates, use the same exact method as for strings. Otherwise
consider

d = pd.date_range('2015-03-31', periods=20).to_series().reset_index(drop=True)

pd.Series(d.values.astype(np.timedelta64), d.index, np.int64)

0     1427760000000000000
1     1427846400000000000
2     1427932800000000000
3     1428019200000000000
4     1428105600000000000
5     1428192000000000000
6     1428278400000000000
7     1428364800000000000
8     1428451200000000000
9     1428537600000000000
10    1428624000000000000
11    1428710400000000000
12    1428796800000000000
13    1428883200000000000
14    1428969600000000000
15    1429056000000000000
16    1429142400000000000
17    1429228800000000000
18    1429315200000000000
19    1429401600000000000
dtype: int64

Upvotes: 0

Tiphaine
Tiphaine

Reputation: 131

To encode non-numeric data to numeric you can use scikit-learn's LabelEncoder. It will encode each category such as COL1's a, b, c to integers.

Assuming df is your dataframe, try:

from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
enc.fit(df['COL1'])
df['COL1'] = enc.transform(df['col1'])
  • enc.fit() creates the corresponding integer values.
  • enc.transform() applies the encoding to the df values.

For the second column, using Pandas to_datetime() function should do the trick, like @quinn-weber mentioned, try:

df['COL2'] = pd.to_datetime(df['COL2'])

Upvotes: 6

Quinn Weber
Quinn Weber

Reputation: 927

You could convert COL1 with something like this:

import pandas as pd
import string
table = pd.DataFrame([
    ['a','9/8/2016',2],
    ['b','12/4/2016',23],
    ['n','1/1/2015',21],
], columns=['COL1', 'COL2', 'COL3'])
table['COL1'] = table['COL1'].map(dict(zip(list(string.lowercase), xrange(0,25))))

As for the timestamp, you could do:

table['COL2'] = pd.to_datetime(
    table['COL2'], format='%m/%d/%Y'
).dt.strftime(date_format='%Y%m%d')

Upvotes: 1

Related Questions