Sergey Shcherbakov
Sergey Shcherbakov

Reputation: 4778

TensorFlow dataset with multi-dimensional Tensors from a CSV file

Is there a way, and if yes, what it is, to load a TensorFlow dataset with multi-dimensional feature Tensor from a CSV (or other format input) file?

For example, my CSV input looks like the following:

f1,  f2,  f3,                      label
0.1, 0.2, 0.1;0.2;0.3;1.1;1.2;1.3, 1
0.2, 0.3, 0.2;0.3;0.4;1.2;1.3;1.4, 0
0.3, 0.4, 0.3;0.4;0.5;1.3;1.4;1.5, 1

I'd like load a dataset from such file, e.g.

import tensorflow as tf

frames_csv_ds = tf.data.experimental.make_csv_dataset(
    'input.csv',
    header=False,
    column_names=['f1','f2','f3','label'],
    batch_size=5,
    label_name='label',
    num_epochs=1,
    ignore_errors=True,)

for batch, label in frames_csv_ds.take(1):
  for key, value in batch.items():
    print(f"{key:20s}: {value}")
  print()
  print(f"{'label':20s}: {label}")

To get the batch as:

f1 : [0.1   0.2   0.3  ]
f2 : [0.2   0.3   0.4  ]
f3 : [ [[0.1, 0.2, 0.3], [1.1, 1.2, 1.3]], [[0.2, 0.3, 0.4], [1.2, 1.3, 1.4]], [[0.3, 0.4, 0.5], [1.3, 1.4, 1.5]] ]
label : [1, 0, 1]

The snippet above is incomplete and doesn't work. Is there away to get the dataset in the illustrated form? If yes, can this be done for arrays of dimensions varying across the dataset?

Upvotes: 1

Views: 380

Answers (2)

Luis l
Luis l

Reputation: 123

This method def get_window_data(df: pd.DataFrame, target: pd.Series, window_size: int, todict=False): receives a df with N columns, and pd.Series with the value of the Ground True Example: df from csv (2d array) of 15596 rows by 76 columns with a Ground True of (15596, 3):

 LOG Shapes: X_train: (15596, 76) y_train: (15596, 3) index_train: (15596,)

Returns Multidimensional, 3d Array (with time windows), for a reference GT N previous time rows. the time windows are composed of the previous 48 rows, in this example

Example: 3d array of 15596 rows, in time windows of 48, by 76 columns with a Ground True of (15596 , 3)

 LOG Shapes: X_train: (15596, 48, 76) y_train: (15596, 3) index_train: (15596,) 
def get_window_data(df: pd.DataFrame, target: pd.Series, window_size: int, todict=False):
    x = []
    y = []
    index = []

    if target is None:
        df.insert(loc=0, column="target", value=0)
        df["target"].iloc[1] = 1;
        df["target"].iloc[2] = 2
        target = pd.get_dummies(df['target'])
        df = df.drop(['target'], axis=1)
        #Por que lo hace en lo anteiror
        print("INFO To create the window_data need to create an artificial Useless Y_target  Count: ", len(df.columns), " Names : ", ",".join(df.columns))
        print("DEBUG features_W3 index Dates:: ", df.index[0], df.index[-1], " Shape: ", df.shape)
        target = target[target.index.isin(df.index)]
        df = df[df.index.isin(target.index)]

    assert len(df) == len(target)

    for i in range(window_size, len(df)+1):
        # Target is at the same row to the feature
        assert df[i - window_size: i].index[-1] == target.index[i-1]
        #Returns Multidimensional, ***3d Array*** (with time windows), for ***a*** reference GT ***N previous time rows***
        x.append(df[i - window_size: i].values) 
        if isinstance(target, pd.DataFrame): y.append(target.iloc[i-1].values)
        else: y.append(target.iloc[i-1])
        index.append(target.index[i-1])

    x = np.array(x, np.float32)
    try:
        index = np.array(index, np.datetime64)
    except Exception as ex:
        print("Exception: ", ex)
        index = np.array(index)
    y = np.array(y)

    if todict: return { 'X': x, 'y': y, 'index': index }
    else: return x, y, index

Example how call the method: https://github.com/Leci37/stocks-prediction-Machine-learning-RealTime-telegram/blob/master/Tutorial/RUN_buy_sell_Tutorial_3W_5min_RT.py

Upvotes: 1

Mohammad Ahmed
Mohammad Ahmed

Reputation: 1634

Well, you can do this by customizing some Tensorflow Functions

import tensorflow as tf

file_path = "data.csv"
dataset = tf.data.TextLineDataset(file_path).skip(1)

def parse_csv_line(line):
  # Split the line into a list of strings
  fields = tf.io.decode_csv(line, record_defaults=[[""]] * 4)
  
  f1 = tf.strings.to_number(fields[0], tf.float32)
  f2 = tf.strings.to_number(fields[1], tf.float32)
  f3 = tf.strings.to_number(tf.strings.split(fields[2], ";"), tf.float32)
  label = tf.strings.to_number(fields[3], tf.int32)
  
  return {"f1": f1, "f2": f2, "f3": f3, "label": label}

dataset = dataset.map(parse_csv_line).batch(5)
next(iter(dataset.take(1)))
{'f1': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.1, 0.2, 0.3], dtype=float32)>,
 'f2': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.2, 0.3, 0.4], dtype=float32)>,
 'f3': <tf.Tensor: shape=(3, 6), dtype=float32, numpy=
 array([[0.1, 0.2, 0.3, 1.1, 1.2, 1.3],
        [0.2, 0.3, 0.4, 1.2, 1.3, 1.4],
        [0.3, 0.4, 0.5, 1.3, 1.4, 1.5]], dtype=float32)>,
 'label': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 0, 1], dtype=int32)>}

Upvotes: 1

Related Questions