Vendetaheist23
Vendetaheist23

Reputation: 41

How to create a neural network that takes in an image and ouputs another image?

I'm trying to create a neural network that has an image of the L (from Lab) format and output the ab dimensions. I am able to pass the L dimension without an issue, but I'm having trouble figuring out how to output the ab dimensions. The output should be for shape 1x2xHxW, where H and W are the height and width of the input image. Here is my network so far:

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        # Get the resnet18 model from torchvision.model library
        self.model = models.resnet18(pretrained=True)
        
        self.model.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)

        # Replace fully connected layer of our model to a 2048 feature vector output
        self.model.classifier = nn.Sequential()

        # Add custom classifier layers
        self.fc1 = nn.Linear(1000, 1024)
        self.Dropout1 = nn.Dropout()
        self.PRelU1 = nn.PReLU()

        self.fc2 = nn.Linear(1024, 512)
        self.Dropout2 = nn.Dropout()
        self.PRelU2 = nn.PReLU()

        self.fc3 = nn.Linear(512, 256)
        self.Dropout3 = nn.Dropout()
        self.PRelU3 = nn.PReLU()

        self.fc4 = nn.Linear(256, 313)
        # self.PRelU3 = nn.PReLU()
        
        self.softmax = nn.Softmax(dim=1)
        self.model_out = nn.Conv2d(313, 2, kernel_size=1, padding=0, dilation=1, stride=1, bias=False)
        self.upsample4 = nn.Upsample(scale_factor=4, mode='bilinear')


    def forward(self, x):
        # x is our input data
        x = self.model(x)
        x = self.Dropout1(self.PRelU1(self.fc1(x)))
        x = self.Dropout2(self.PRelU2(self.fc2(x)))
        x = self.Dropout3(self.PRelU3(self.fc3(x)))
        x = self.softmax(self.fc4(x))
        
        return x

Upvotes: 0

Views: 1087

Answers (1)

Theodor Peifer
Theodor Peifer

Reputation: 3496

I dont really know what you mean by "ab dimensions" and Im not sure what the "L format" is but I can tell you how to use cnns to generate images.

Normally you would use an autoencoder but that depends on the task. An autoencoder takes a image as input and, similiar to a normal classification reduces the dimensions. But unlike in classification you dont flatten the featuremaps and add classification layers, but you unsample and deconvolve them. So first you "encode" the "image" and then you "decode" it. The middle layer before you start upsampling, is called the bottleneck. There are no dense layers and no softmax activations needed.

Here is an example how this would look like as a pytorch model (an Autoencoder for the cifar10 dataset):

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()

        """ encoder """
        self.conv1 = nn.Conv2d(3, 32, kernel_size=(5, 5))
        self.batchnorm1 = nn.BatchNorm2d(32)

        self.conv2 = nn.Conv2d(32, 64, kernel_size=(4, 4), stride=3)
        self.batchnorm2 = nn.BatchNorm2d(64)

        self.conv3 = nn.Conv2d(64, 128, kernel_size=(3, 3), stride=3)
        self.batchnorm3 = nn.BatchNorm2d(128)

        self.maxpool2x2 = nn.MaxPool2d(2)   # not in usage

        """ decoder """
        self.upsample2x2 = nn.Upsample(scale_factor=2)   # not in usage

        self.deconv1 = nn.ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=3)
        self.batchnorm1 = nn.BatchNorm2d(64)

        self.deconv2 = nn.ConvTranspose2d(64, 32, kernel_size=(4, 4), stride=3)
        self.batchnorm2 = nn.BatchNorm2d(32)

        self.deconv3 = nn.ConvTranspose2d(32, 3, kernel_size=(5, 5))
        self.batchnorm3 = nn.BatchNorm2d(3)
    

    def forward(self, x, train_: bool=True, print_: bool=False, return_bottlenecks: bool=False):

        """ encoder """
        x = self.conv1(x)
        x = self.batchnorm1(x)
        x = F.relu(x)

        x = self.conv2(x)
        x = self.batchnorm2(x)
        x = F.relu(x)

        x = self.conv3(x)
        x = self.batchnorm3(x)
        bottlenecks = F.relu(x)

        """ decoder """
        x = self.deconv1(bottlenecks)
        x = self.batchnorm1(x)
        x = F.relu(x)

        x = self.deconv2(x)
        x = self.batchnorm2(x)
        x = F.relu(x)

        x = self.deconv3(x)
        x = torch.sigmoid(x)

        return x

In this example I dont use "maxpool" and "upsample" but that depends on your model. Upsample is basically the opposite of of maxpool, and you can see ConvTranspose2d also like the opposite of convolution (even though that wouldnt really be he right explanaition).

So you basically want the "decoder" part to be the opposite (or mirrored version) of the "encoder" part. Figuring out the dimensions kernel-sizes etc for each layer can be quite tricky but you basically have to set them so that the archtitecture is almost symmetrical and the output dimensions of the model are the size of the image you want to produce.

Thats what an "image producing" architecture could like like:

enter image description here

source: https://www.semanticscholar.org/paper/Feature-discovery-and-visualization-of-robot-data-Flaspohler-Roy/514a2f7461edd3e4c2d56d57f9002e1dc445eb58/figure/1

Upvotes: 1

Related Questions