Leb_Broth
Leb_Broth

Reputation: 1111

Tensorflow: Passing CSV with 3D feature array

My current text file that I intend to use for LSTM training in Tensorflow looks like this:

> 0.2, 4.3, 1.2
> 1.1, 2.2, 3.1
> 3.5, 4.1, 1.1, 4300
> 
> 1.2, 3.3, 1.2
> 1.5, 2.4, 3.1
> 3.5, 2.1, 1.1, 4400
> 
> ...

There are 3 sequences 3 features vectors with only 1 label for each sample. I formatted this text file so it can be consistent with the LSTM training as the latter requires a time-steps of the sequences or in general, LSTM training requires a 3D tensor (batch, num of time-steps, num of features).

My question: How should I use Numpy or TensorFlow.TextReader in order to reformat the 3x3 sequence vectors and the singleton Labels so it can become compatible with Tensorflow?

Edit: I saw many tutorials on reformatting text or CSV files that has vectors and labels but unfortunately they were for 1 to 1 relationships e.g.

0.2, 4.3, 1.2, Class1
1.1, 2.2, 3.1, Class2
3.5, 4.1, 1.1, Class3

becomes:

[0.2, 4.3, 1.2, Class1], [1.1, 2.2, 3.1, Class2], [3.5, 4.1, 1.1, Class3]

which clearly is readable by Numpy and can build vectors easily from it dedicated for simple Feed-Forward NN tasks. But this procedure doesn't actually build an LSTM friendly CSV.

EDIT: The TensorFlow tutorial on CSV formats, covers only 2D arrays as an example. The features = col1, col2, col3 doesn't assume that there might be time-steps for each sequence array and hence my question.

Upvotes: 1

Views: 1432

Answers (2)

hpaulj
hpaulj

Reputation: 231325

I'm a little confused as to whether you are more interested in the numpy array(s) structure, or the csv fomat.

The np.savetxt csv file writer can't readily produce text like:

0.2, 4.3, 1.2
1.1, 2.2, 3.1
3.5, 4.1, 1.1, 4300

1.2, 3.3, 1.2
1.5, 2.4, 3.1
3.5, 2.1, 1.1, 4400

savetxt is not tricky. It opens a file for writing, and then iterates on the input array, writing it, one row at a time to the file. Effectively:

 for row in arr:
    f.write(fmt % tuple(row))

where fmt has a % field for each element of the the row. In the simple case it constructs fmt = delimiter.join(['fmt']*(arr.shape[1])). In other words repeating the simgle field fmt for the number of columns. Or you can give it a multifield fmt.

So you could use normal line/file writing methods to write a custom display. The simplest is to construct it using the usual print commends, and then redirect those to a file.

But having done that, there's the question of how to read that back into a numpy session. np.genfromtxt can handle missing data, but you still have to include the delimiters. It's also trickier to have it read blocks (3 lines separated by a blank line). It's not impossible, but you have to do some preprocessing.

Of course genfromtxt isn't that tricky either. It reads the file line by line, converts each line into a list of numbers or strings, and collects those lists in a master list. Only at the end is that list converted into an array.

I can construct an array like your text with:

In [121]: dt = np.dtype([('lbl',int), ('block', float, (3,3))])
In [122]: A = np.zeros((2,),dtype=dt)
In [123]: A
Out[123]: 
array([(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]),
       (0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])], 
      dtype=[('lbl', '<i4'), ('block', '<f8', (3, 3))])
In [124]: A['lbl']=[4300,4400]
In [125]: A[0]['block']=np.array([[.2,4.3,1.2],[1.1,2.2,3.1],[3.5,4.1,1.1]])
In [126]: A
Out[126]: 
array([(4300, [[0.2, 4.3, 1.2], [1.1, 2.2, 3.1], [3.5, 4.1, 1.1]]),
       (4400, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])], 
      dtype=[('lbl', '<i4'), ('block', '<f8', (3, 3))])
In [127]: A['block']
Out[127]: 
array([[[ 0.2,  4.3,  1.2],
        [ 1.1,  2.2,  3.1],
        [ 3.5,  4.1,  1.1]],

       [[ 0. ,  0. ,  0. ],
        [ 0. ,  0. ,  0. ],
        [ 0. ,  0. ,  0. ]]])

I can load it from a txt that has all the block values flattened:

In [130]: txt=b"""4300, 0.2, 4.3, 1.2, 1.1, 2.2, 3.1, 3.5, 4.1, 1.1"""
In [131]: txt
Out[131]: b'4300, 0.2, 4.3, 1.2, 1.1, 2.2, 3.1, 3.5, 4.1, 1.1'

genfromtxt can handle a complex dtype, allocating values in order from the flat line list:

In [133]: data=np.genfromtxt([txt],delimiter=',',dtype=dt)
In [134]: data['lbl']
Out[134]: array(4300)
In [135]: data['block']
Out[135]: 
array([[ 0.2,  4.3,  1.2],
       [ 1.1,  2.2,  3.1],
       [ 3.5,  4.1,  1.1]])

I'm not sure about writing it. I have have to reshape it into a 10 column or field array, if I want to use savetxt.

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210812

UPDATE: addition to the previos answer:

df.stack().to_csv('d:/temp/1D.csv', index=False)

1D.csv:

0.2
4.3
1.2
4300.0
1.1
2.2
3.1
4300.0
3.5
4.1
1.1
4300.0
1.2
3.3
1.2
4400.0
1.5
2.4
3.1
4400.0
3.5
2.1
1.1
4400.0

OLD answer:

Here is a Pandas solution.

Assume we have the following text file:

0.2, 4.3, 1.2
1.1, 2.2, 3.1
3.5, 4.1, 1.1, 4300

1.2, 3.3, 1.2
1.5, 2.4, 3.1
3.5, 2.1, 1.1, 4400

Code:

import pandas as pd

In [95]: fn = r'D:\temp\.data\data.txt'

In [96]: df = pd.read_csv(fn, sep=',', skipinitialspace=True, header=None, names=list('abcd'))

In [97]: df
Out[97]:
     a    b    c       d
0  0.2  4.3  1.2     NaN
1  1.1  2.2  3.1     NaN
2  3.5  4.1  1.1  4300.0
3  1.2  3.3  1.2     NaN
4  1.5  2.4  3.1     NaN
5  3.5  2.1  1.1  4400.0

In [98]: df.d = df.d.bfill()

In [99]: df
Out[99]:
     a    b    c       d
0  0.2  4.3  1.2  4300.0
1  1.1  2.2  3.1  4300.0
2  3.5  4.1  1.1  4300.0
3  1.2  3.3  1.2  4400.0
4  1.5  2.4  3.1  4400.0
5  3.5  2.1  1.1  4400.0

now you can save it back to CSV:

df.to_csv('d:/temp/out.csv', index=False, header=None)

d:/temp/out.csv:

0.2,4.3,1.2,4300.0
1.1,2.2,3.1,4300.0
3.5,4.1,1.1,4300.0
1.2,3.3,1.2,4400.0
1.5,2.4,3.1,4400.0
3.5,2.1,1.1,4400.0

Upvotes: 1

Related Questions