Reputation: 1111
My current text file that I intend to use for LSTM training in Tensorflow looks like this:
> 0.2, 4.3, 1.2
> 1.1, 2.2, 3.1
> 3.5, 4.1, 1.1, 4300
>
> 1.2, 3.3, 1.2
> 1.5, 2.4, 3.1
> 3.5, 2.1, 1.1, 4400
>
> ...
There are 3 sequences 3 features vectors with only 1 label for each sample. I formatted this text file so it can be consistent with the LSTM training as the latter requires a time-steps of the sequences or in general, LSTM training requires a 3D tensor (batch, num of time-steps, num of features).
My question: How should I use Numpy or TensorFlow.TextReader
in order to reformat the 3x3 sequence vectors and the singleton Labels so it can become compatible with Tensorflow?
Edit: I saw many tutorials on reformatting text or CSV files that has vectors and labels but unfortunately they were for 1 to 1 relationships e.g.
0.2, 4.3, 1.2, Class1
1.1, 2.2, 3.1, Class2
3.5, 4.1, 1.1, Class3
becomes:
[0.2, 4.3, 1.2, Class1], [1.1, 2.2, 3.1, Class2], [3.5, 4.1, 1.1, Class3]
which clearly is readable by Numpy and can build vectors easily from it dedicated for simple Feed-Forward NN tasks. But this procedure doesn't actually build an LSTM friendly CSV.
EDIT: The TensorFlow tutorial on CSV formats, covers only 2D arrays as an example. The features = col1, col2, col3
doesn't assume that there might be time-steps for each sequence array and hence my question.
Upvotes: 1
Views: 1432
Reputation: 231325
I'm a little confused as to whether you are more interested in the numpy
array(s) structure, or the csv fomat.
The np.savetxt
csv file writer can't readily produce text like:
0.2, 4.3, 1.2
1.1, 2.2, 3.1
3.5, 4.1, 1.1, 4300
1.2, 3.3, 1.2
1.5, 2.4, 3.1
3.5, 2.1, 1.1, 4400
savetxt
is not tricky. It opens a file for writing, and then iterates on the input array, writing it, one row at a time to the file. Effectively:
for row in arr:
f.write(fmt % tuple(row))
where fmt
has a %
field for each element of the the row
. In the simple case it constructs fmt = delimiter.join(['fmt']*(arr.shape[1]))
. In other words repeating the simgle field fmt
for the number of columns. Or you can give it a multifield fmt
.
So you could use normal line/file writing methods to write a custom display. The simplest is to construct it using the usual print
commends, and then redirect those to a file.
But having done that, there's the question of how to read that back into a numpy
session. np.genfromtxt
can handle missing data, but you still have to include the delimiters. It's also trickier to have it read blocks (3 lines separated by a blank line). It's not impossible, but you have to do some preprocessing.
Of course genfromtxt
isn't that tricky either. It reads the file line by line, converts each line into a list of numbers or strings, and collects those lists in a master list. Only at the end is that list converted into an array.
I can construct an array like your text with:
In [121]: dt = np.dtype([('lbl',int), ('block', float, (3,3))])
In [122]: A = np.zeros((2,),dtype=dt)
In [123]: A
Out[123]:
array([(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]),
(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])],
dtype=[('lbl', '<i4'), ('block', '<f8', (3, 3))])
In [124]: A['lbl']=[4300,4400]
In [125]: A[0]['block']=np.array([[.2,4.3,1.2],[1.1,2.2,3.1],[3.5,4.1,1.1]])
In [126]: A
Out[126]:
array([(4300, [[0.2, 4.3, 1.2], [1.1, 2.2, 3.1], [3.5, 4.1, 1.1]]),
(4400, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])],
dtype=[('lbl', '<i4'), ('block', '<f8', (3, 3))])
In [127]: A['block']
Out[127]:
array([[[ 0.2, 4.3, 1.2],
[ 1.1, 2.2, 3.1],
[ 3.5, 4.1, 1.1]],
[[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ]]])
I can load it from a txt that has all the block values flattened:
In [130]: txt=b"""4300, 0.2, 4.3, 1.2, 1.1, 2.2, 3.1, 3.5, 4.1, 1.1"""
In [131]: txt
Out[131]: b'4300, 0.2, 4.3, 1.2, 1.1, 2.2, 3.1, 3.5, 4.1, 1.1'
genfromtxt
can handle a complex dtype, allocating values in order from the flat line list:
In [133]: data=np.genfromtxt([txt],delimiter=',',dtype=dt)
In [134]: data['lbl']
Out[134]: array(4300)
In [135]: data['block']
Out[135]:
array([[ 0.2, 4.3, 1.2],
[ 1.1, 2.2, 3.1],
[ 3.5, 4.1, 1.1]])
I'm not sure about writing it. I have have to reshape it into a 10 column or field array, if I want to use savetxt
.
Upvotes: 1
Reputation: 210812
UPDATE: addition to the previos answer:
df.stack().to_csv('d:/temp/1D.csv', index=False)
1D.csv:
0.2
4.3
1.2
4300.0
1.1
2.2
3.1
4300.0
3.5
4.1
1.1
4300.0
1.2
3.3
1.2
4400.0
1.5
2.4
3.1
4400.0
3.5
2.1
1.1
4400.0
OLD answer:
Here is a Pandas solution.
Assume we have the following text file:
0.2, 4.3, 1.2
1.1, 2.2, 3.1
3.5, 4.1, 1.1, 4300
1.2, 3.3, 1.2
1.5, 2.4, 3.1
3.5, 2.1, 1.1, 4400
Code:
import pandas as pd
In [95]: fn = r'D:\temp\.data\data.txt'
In [96]: df = pd.read_csv(fn, sep=',', skipinitialspace=True, header=None, names=list('abcd'))
In [97]: df
Out[97]:
a b c d
0 0.2 4.3 1.2 NaN
1 1.1 2.2 3.1 NaN
2 3.5 4.1 1.1 4300.0
3 1.2 3.3 1.2 NaN
4 1.5 2.4 3.1 NaN
5 3.5 2.1 1.1 4400.0
In [98]: df.d = df.d.bfill()
In [99]: df
Out[99]:
a b c d
0 0.2 4.3 1.2 4300.0
1 1.1 2.2 3.1 4300.0
2 3.5 4.1 1.1 4300.0
3 1.2 3.3 1.2 4400.0
4 1.5 2.4 3.1 4400.0
5 3.5 2.1 1.1 4400.0
now you can save it back to CSV:
df.to_csv('d:/temp/out.csv', index=False, header=None)
d:/temp/out.csv:
0.2,4.3,1.2,4300.0
1.1,2.2,3.1,4300.0
3.5,4.1,1.1,4300.0
1.2,3.3,1.2,4400.0
1.5,2.4,3.1,4400.0
3.5,2.1,1.1,4400.0
Upvotes: 1