Reputation: 4778
Is there a way, and if yes, what it is, to load a TensorFlow dataset with multi-dimensional feature Tensor from a CSV (or other format input) file?
For example, my CSV input looks like the following:
f1, f2, f3, label
0.1, 0.2, 0.1;0.2;0.3;1.1;1.2;1.3, 1
0.2, 0.3, 0.2;0.3;0.4;1.2;1.3;1.4, 0
0.3, 0.4, 0.3;0.4;0.5;1.3;1.4;1.5, 1
I'd like load a dataset from such file, e.g.
import tensorflow as tf
frames_csv_ds = tf.data.experimental.make_csv_dataset(
'input.csv',
header=False,
column_names=['f1','f2','f3','label'],
batch_size=5,
label_name='label',
num_epochs=1,
ignore_errors=True,)
for batch, label in frames_csv_ds.take(1):
for key, value in batch.items():
print(f"{key:20s}: {value}")
print()
print(f"{'label':20s}: {label}")
To get the batch as:
f1 : [0.1 0.2 0.3 ]
f2 : [0.2 0.3 0.4 ]
f3 : [ [[0.1, 0.2, 0.3], [1.1, 1.2, 1.3]], [[0.2, 0.3, 0.4], [1.2, 1.3, 1.4]], [[0.3, 0.4, 0.5], [1.3, 1.4, 1.5]] ]
label : [1, 0, 1]
The snippet above is incomplete and doesn't work. Is there away to get the dataset in the illustrated form? If yes, can this be done for arrays of dimensions varying across the dataset?
Upvotes: 1
Views: 380
Reputation: 123
This method def get_window_data(df: pd.DataFrame, target: pd.Series, window_size: int, todict=False):
receives a df with N columns, and pd.Series with the value of the Ground True
Example: df from csv (2d array) of 15596 rows by 76 columns with a Ground True of (15596, 3):
LOG Shapes: X_train: (15596, 76) y_train: (15596, 3) index_train: (15596,)
Returns Multidimensional, 3d Array (with time windows), for a reference GT N previous time rows. the time windows are composed of the previous 48 rows, in this example
Example: 3d array of 15596 rows, in time windows of 48, by 76 columns with a Ground True of (15596 , 3)
LOG Shapes: X_train: (15596, 48, 76) y_train: (15596, 3) index_train: (15596,)
def get_window_data(df: pd.DataFrame, target: pd.Series, window_size: int, todict=False):
x = []
y = []
index = []
if target is None:
df.insert(loc=0, column="target", value=0)
df["target"].iloc[1] = 1;
df["target"].iloc[2] = 2
target = pd.get_dummies(df['target'])
df = df.drop(['target'], axis=1)
#Por que lo hace en lo anteiror
print("INFO To create the window_data need to create an artificial Useless Y_target Count: ", len(df.columns), " Names : ", ",".join(df.columns))
print("DEBUG features_W3 index Dates:: ", df.index[0], df.index[-1], " Shape: ", df.shape)
target = target[target.index.isin(df.index)]
df = df[df.index.isin(target.index)]
assert len(df) == len(target)
for i in range(window_size, len(df)+1):
# Target is at the same row to the feature
assert df[i - window_size: i].index[-1] == target.index[i-1]
#Returns Multidimensional, ***3d Array*** (with time windows), for ***a*** reference GT ***N previous time rows***
x.append(df[i - window_size: i].values)
if isinstance(target, pd.DataFrame): y.append(target.iloc[i-1].values)
else: y.append(target.iloc[i-1])
index.append(target.index[i-1])
x = np.array(x, np.float32)
try:
index = np.array(index, np.datetime64)
except Exception as ex:
print("Exception: ", ex)
index = np.array(index)
y = np.array(y)
if todict: return { 'X': x, 'y': y, 'index': index }
else: return x, y, index
Example how call the method: https://github.com/Leci37/stocks-prediction-Machine-learning-RealTime-telegram/blob/master/Tutorial/RUN_buy_sell_Tutorial_3W_5min_RT.py
Upvotes: 1
Reputation: 1634
Well, you can do this by customizing some Tensorflow Functions
import tensorflow as tf
file_path = "data.csv"
dataset = tf.data.TextLineDataset(file_path).skip(1)
def parse_csv_line(line):
# Split the line into a list of strings
fields = tf.io.decode_csv(line, record_defaults=[[""]] * 4)
f1 = tf.strings.to_number(fields[0], tf.float32)
f2 = tf.strings.to_number(fields[1], tf.float32)
f3 = tf.strings.to_number(tf.strings.split(fields[2], ";"), tf.float32)
label = tf.strings.to_number(fields[3], tf.int32)
return {"f1": f1, "f2": f2, "f3": f3, "label": label}
dataset = dataset.map(parse_csv_line).batch(5)
next(iter(dataset.take(1)))
{'f1': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.1, 0.2, 0.3], dtype=float32)>,
'f2': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.2, 0.3, 0.4], dtype=float32)>,
'f3': <tf.Tensor: shape=(3, 6), dtype=float32, numpy=
array([[0.1, 0.2, 0.3, 1.1, 1.2, 1.3],
[0.2, 0.3, 0.4, 1.2, 1.3, 1.4],
[0.3, 0.4, 0.5, 1.3, 1.4, 1.5]], dtype=float32)>,
'label': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 0, 1], dtype=int32)>}
Upvotes: 1