PyTorch Dataloader for multiple files with sliding window

I am working on a problem where I have multiple CSVs files and I need to read those multiple CSVs one by one with a sliding window. Let’s assume that, one CSV file is having 330 data points and the window size is 32 so we should be having (10*32 = 320) and the last 10 points will be discarded.

I started making a dataset that looks like this but after spending too much time, I am not able to get it working. The current code looks like this,

class CustomDataset(Dataset):
def __init__(self, data_folder, window_size):
    self.data_folder = data_folder
    self.data_file_list = [file for file in os.listdir(data_folder)]
    print(self.data_file_list)
    self.window_size = window_size
    
def __len__(self):
    return len(self.data_file_list[0])

def __getitem__(self, idx):
    filename = self.data_file_list[idx]
    data, label = read_file(filename)
    return data, label

def read_file(self, filename):
    data = pd.read_csv(filename)
    data = data.drop(["file_name", "class_name"], axis = 1)
    features = data.drop(["class_no"], axis = 1)
    labels = data["class_no"]
    x = [features[index:index+self.window_size].values for index in range(0, len(features))]
    y = [labels[index:index+self.window_size].values for index in range(0, len(labels))]
    
    return x, y

Note: I can’t merge all these CSV files into one.

I am getting this error, TypeError: object of type 'type' has no len()

Upvotes: 0

Answers (1)

itzortzis

Reputation: 45

I propose the following workaround. According to this, the getitem function retrieves a specific window which belongs to a csv file and not the file itself. Towards this direction, find_num_of_windows computes the number of windows occur for a given csv file. The len(self) function will return the sum of the windows of all files. In this way, the idx input of the getitem function will no longer have an upper limit equal to the number of files. Instead the upper limit would be the number of windows of all files. The create_dataset_dict function assigns to all potential idx values the corresponding filename and window index.

Comments:

The code needs optimization. Though, I chose a simple way for easier understanding.
I don't know how exatly the read_file function works, so I just tried something as an example.

Hope it helps!

import csv

class CustomDataset(Dataset):
def __init__(self, data_folder, data_list_filename, window_size):
    self.data_folder = data_folder
    self.data_file_list = [file for file in os.listdir(data_folder)]
    self.window_size = window_size
    self.total_windows, self.dataset_dict = create_dataset_dict()


def find_num_of_windows(self, path_to_file):
    rows = 0
    with open(path_to_file) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        line_count = 0
        rows = sum([1 for row in csv_reader])
    windows = rows // self.window_size


def create_dataset_dict(self):
    total_windows = 0
    idx = 0
    dataset_dict = {}
    for i in range(len(self.data_file_list)):
        windows_of_current_file = find_num_of_windows(self.data_file_list[i])
        total_windows += windows_of_current_file

        for j in range(idx:idx + windows_of_current_file):
            dataset_dict[idx] = {
                "filename": self.data_file_list[i],
                "window_index": j
            }

    return total_windows, dataset_dict


def read_file(filename, w_index):
    with open(filename) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        line_idx = 0
        data = []
        labels = []
        for row in csv_reader:
            if line_idx >= w_index and line_idx < w_index + self.window_size:
                data.append(row[0])
                labels.append(row[1])
                
        return data, labels
        
    
def __len__(self):
    return self.total_windows
    

def __getitem__(self, idx):
    
    filename = self.dataset_dict[idx]["filename"]
    w_index = self.dataset_dict[idx]["window_index"]
    data, label = read_file(filename, w_index)
    return data, label

Example:

Assuming we have 3 CSV files - csv_1, csv_2 and csv_3

From csv_1 we can extract 2 windows, from csv_2 we can extract 3 windows and from csv_3 we can extract 1 window.

Then:

find_num_of_windows(path_to_csv_1) returns 2
find_num_of_windows(path_to_csv_2) returns 3
find_num_of_windows(path_to_csv_3) returns 1

By calling create_dataset_dict() function, the dataset dictionary is created and looks like the one below:

{ 
  1: {
       "filename": path_to_csv_1,
       "window_index": 1
     },
  2: {
       "filename": path_to_csv_1,
       "window_index": 2
     },
  3: {
       "filename": path_to_csv_2,
       "window_index": 1
     },
  4: {
       "filename": path_to_csv_2,
       "window_index": 2
     },
  5: {
       "filename": path_to_csv_2,
       "window_index": 3
     },
  6: {
       "filename": path_to_csv_3,
       "window_index": 1
     }
}

If we now call the getitem function using an idx in [1, 2, 3, 4, 5, 6], we can retrieve the corresponding window using the aformentioned dictionary. For example, if we give 4 as input to getitem we will retrieve the second window of file csv_2.

Upvotes: 1

PyTorch Dataloader for multiple files with sliding window

Answers (1)

Related Questions