Reputation: 11
I need to convert large csv into one hot encode np.ndarray for Keras Model.
For ex: csv data =
F1 F2 F3
1.'M' 'N' 'I'
2.'-' 'M' 'K'
Each Column's Possible Values
F1: ['-', 'M', 'N']
F2: ['-', 'A', 'B', 'M', 'N']
F3: ['-', 'I', 'J', 'K']
Expected Value(One hot encode in np.array)
F1 F2 F3
1. 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0
2. 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1
I'm able read csv, then parse row by row. But it's slow, and I've a very large file. Is there a way to use "pd.dataframe.apply" to convert to one hot encode?
Upvotes: 0
Views: 1246
Reputation: 11
Dummies.... Lol, panda have get_dummies function for dummy like me. Here is video https://www.youtube.com/watch?v=0s_1IsROgDc
After I implemented get_dummies, my module throw size error. I found out that because I use .fit_generator(), then load a chunk of dataframe, then get_dummies. It'll return inconsistent sizes if a batch don't have all possible values.
Solution: from sklearn.preprocessing import OneHotEncoder
Lesson here, if you have large data-set, more work for you.
Upvotes: 1
Reputation:
To generate data to test my method I generated a file using the format you specified of 60000000 lines (or every combination of the above which is 60 times 1000000). Because your data for each line can only be one of 60 options, instead of storing the data (since order shouldn't matter), storing the count of each appearance of each row is a lot faster since than instead of converting 60000000 lines, you convert 60 into your one hot encoding. Note: the data file ended up being 480mb. The following code reads in the data into a dictionary:
def foo():
data = {}
with open('data.csv') as f:
for line in f:
try:
data[line] += 1
except KeyError as e:
data[line] = 1
With print(timeit(__main__, number=10))
I achieved a time of 125.45043465401977
.
From that point you can convert each string row into the one hot encoding and add n copies for training. This should also make training your model easier, since Keras can use a python generator object to train on. This means that at no point all of the data is stored in memory, allowing training to be done on larger than RAM size datasets.
Upvotes: 0