Skip to main content
Version: Next

AI Architect - Advanced deep learning settings

The AI Architect allows the user to control deep learning settings with respect to Input and Sampling, Context-aware region mapping, Network Parameters, Training Parameters, and Data Augmentation . The AI Architect is opened by pressing the Advanced Deep Learning settings button .

Input and Sampling

In Inputs, you define the inputs of the neural network. This means that you can build networks for brightfield, fluorescence and IMC images. You can also change the image channel if you have aligned WSS images (also if you have an already trained network). You can even set up multiple cross channel input sending all information directly to the network e.g. if you have aligned a brightfield image and a fluorescence image. By default, all bands from the first image channel are selected as inputs.

Context-aware region mapping

Context Aware Region Mapping is a special tool for anatomical region segmentation (e.g. brain and spinal cords regions, kidney regions etc.), where tissue contrasts are not always sufficient for accurate segmentation and contextual information of the entire tissue section is needed. It can be useful for segmentation tasks where the regions differ in size from tissue to tissue due to its adaptive resolution capabilities, e.g. different sections throughout the brain volume.

Example of annotations that can be used for training a Deep Learning classifier with Context Aware Region Mapping.   
The ROI outlines relevant tissue with the context needed to segment the areas of interest.
Example of annotations that can be used for training a Deep Learning classifier with Context Aware Region Mapping. The ROI outlines relevant tissue with the context needed to segment the areas of interest.

Activating Context Aware Region Mapping will change the sampling strategy to ROI(s) instead of labels during training of the deep learning network. That means, magnification (which is normally fixed), becomes adaptive and selects the optimal magnification for each specified ROI. Context Aware Region Mapping therefore enables multiple regions trained at multiple resolutions (adaptive resolution).

Context Aware Region Mapping can be activated for all Deep Learning classifiers. However, DeepLabv3+ is generally recommended due to its global context capabilities. For training a deep learning classifier with Context Aware Region Mapping, please follow the instructions in the manual Training a deep learning classifier. Because the ROI now serves as sampling object, make sure to include an ROI that outline the tissue section and covers all annotated labels (the ROI would normally be the outline of your tissue).

Context Aware Region Mapping can be enabled by going into the Advanced Deep Learning settings and toggling on Context Aware Region Mapping (Adaptive Resolution), as illustrated below.

note
  • Because the resolution is adaptive, make sure to checkmark "Treat Regions Individually" when running the APP in images with multiple tissue sections (ROIs). Otherwise the FOV can expand to fit all ROIs within one FOV, i.e. the samples are not analyzed one by one.
  • It is still necessary to give examples of the background class using label 001
  • If you switch Deep Learning classifier, make sure to re-activate Context-Aware Region Mapping in the AI architect.

Input Size (Pixels). Use the input size to control the size (width and height in pixels) of each image sent through the network. Depending on which problem you are trying to solve, you need to consider the receptive field in the tissue, i.e. how much of the tissue is available for the neural network. You can control the receptive field by adjusting the magnification and/or the input size. Generally, for problems that require very local information (e.g. nuclei or membrane segmentation), you can select 512 or 256 pixels because it is only pixels close to every single nucleus that is required. On the other hand, larger structures (e.g. metastases or vessels) can be affected by contextual information, i.e. the more context included in each prediction the better. The input size can also viewed by toggling the show preview check-box. The green box with the cross in the middle is the physical size of the input size.

warning

After training, the input size cannot be changed unless the trained network is discarded.

The input size of a neural network and the APPs' FOV Size are not the same.

In Inputs, you define the inputs of the neural network. You can change the image channel if you have aligned WSS images. You can even set up multiple cross channel input sending all information directly to the network e.g. if you have aligned a brightfield image and a fluorescence image. By default, all bands from the first image channel are selected as inputs.

One important aspect to be aware of when selecting the input is the automatic weight initialization used in Visiopharm. Recall that the network weights are its parameters, which updates during training. When you create a completely new APP and use 3 or fewer inputs (e.g. RGB from a single brightfield image), Visiopharm takes advantage of a pre-trained network trained on a large dataset that expects exactly 3 inputs. This means that all weights in the encoder are initialized with pretrained weights. By doing so, Visiopharm gives your network a so-called head start, meaning that you need less annotated data to obtain good results much faster and makes the network less prone to overfitting. If only 1 input is chosen Visiopharm duplicates that input 2 times and feeds that into the pre-trained network. Thereby fulling it's 3 input requirement. When 2 inputs are chosen Visiopharm creates the third which is a product of the mean of the first 2 inputs.

When you create a completely new APP and use more than 3 inputs (e.g. from a single fluorescence image), we cannot take advantage of a pretrained network as it expects exactly 3 inputs. This means that all weights are initialized with random values which follow a certain distribution. When the weights are not pretrained, you need to provide more annotated training data, and train for more iterations.

The inputs are preprocessed, which stabilize and improve the training of the neural networks. One of the preprocessing steps is mean subtraction per input dimension, which is applied to RGB images and involves per input subtraction of the mean values across the input dimension as:

x=xmean(x)x' = x - mean(x)

where xx is the original value,mean(x)mean(x) is the per input mean and xx is the new value.

For non-RGB images (eg. fluorescence images) Min-max normalization per input dimension is applied, to normalize different inputs to approxmately the same scale, as:

x=xmin(x)max(x)min(x)x' = \frac{x-min(x)}{max(x)-min(x)}

where xx is the original value, min(x)min(x) and max(x)max(x) is the estimated minimum and maximum values calculated across the dataset and xx' is the transformed normalized value.

Once training is complete, new channels can be selected as inputs. This may be required if biomarkers are in a different channel position in other experiments.

For the training to be maintained, the number of input channels must remain the same. If the number of channels is changed, the training will be discarded and the APP will need to be re-trained.

Network Parameters

When examining network parameters, it is essential to delve into the notions of network depth and network width.

  1. Network depth: The number of building blocks in the encoder with each block containing one or more layers. E.g. the U-Net encoder has 4 blocks.
  2. Network width: The number of filters of each convolutional layer both in the encoder and decoder. E.g. the convolutional layers of the first block in the U-Net have 64 filters each.

Freeze Depth parameter. Freeze depth is the number of encoder blocks that are not updated during training. By freezing the early layers, these weights remain unchanged during training. This gives advanced fine-tuning capabilities when training is continued and adds new training data to an existing neural network. This is also a method used to reduce overfitting of the neural network when you don't have a large annotated dataset, while also leading to faster training. In the earlier blocks of the CNN, the network learns to differentiate between simple abstract features, such as edges and curves, whilst the deeper blocks of the network combine these features to more complex ones.

How much to freeze when creating a new APP? For 3 inputs (e.g. RGB but not limited to), recall that Visiopharm automatically uses pre-trained weight initialization. This means that the default freeze depth is either two or one depending on the current network. Hence, we do not update weights in the early layers as these are found to be very good general feature extractors. If you have a large annotated training dataset, you can unfreeze all layers of the network (toward zero) as you do not need to worry about overfitting and the network can learn the best features possible. However, if you have limited annotated data, you can control overfitting by freezing one or more blocks.

Practical tips for freeze depth when fine-tuning an existing network When fine-tuning an existing neural network several aspects needs to be considered. First, all network weights are initialized with the previously existing state of the network, i.e. the training can be continued. Depending on how different the new training data is compared to the original data, you can freeze more layers to make sure that you are only fine-tuning the highly specialized features deep in the network and not changing the early features too much. For example, if you experience new variation in a staining that was in the original training set, you can fine-tune only the last part of a network to learn the new variation without adjusting the weights too much. On the other hand, if you want to use an existing AI APP on a structural case that was not present in the original training data (i.e. more than a little new stain variation), you might have to freeze less of the network. You need also to consider the learning rate when performing fine-tuning as the parameter controls how much the trainable weights are updated in each step/iteration.

Width Parameter. The network width controls the percentage of original network size. Specifically, you control the percentage of original filters in each convolutional layer, so you can tailor the size of the network for your specific problem. If the network is too wide, the network can overfit to the training data and lose the concept of generalization, which is not desirable. On the other hand, if the network is too thin, it might not have enough parameters to solve your problem. As a rule-of-thumb, the more difficult the problem is, the more network parameters you need. By adjusting the width you can, therefore, control overfitting but also improve the training and execution speed of the neural network. A smaller network means faster evaluation. The percentage is approximal, as the exact number of filters is rounded up the nearest divisible by 8 and a minimum of 8 filters due to hardware optimization. Remeber that if changed after training, the training is deleted because Visiopharm builds a new network based on your settings. If the width is 100%, Visiopharm uses all the original pretrained weights. However, when the width is less the 100%, a random subset of the original pretrained weight are subsampled for each convolutional layer. This means that you might need to set freeze depth to zero to obtain the best possible results as you ensure that all layers are trained.

Training Parameters

Learning rate (Adam optimization algorithm). The deep learning training algorithm uses the Adam Optimization algorithm for adjusting the weight parameters of the network. It is possible for the user to adjust the learning rate of the Adam algorithm.

This value determines the step size for each iteration. Having a too low learning rate will make the network converge very slowly as the values of the weights are changed by a very small amount. Having a too high learning rate, on the other hand, will make it difficult to properly converge at a local minimum and thus lead to a poor performing network.

When and how to adjust the learning rate

The learning rate is an important network parameter for training and fine-tuning neural networks, as the learning rate has the potential to both destroy and create excellent classification results. When the ideal learning rate has been selected, the loss function is decreasing with increasing network iterations.

When to lower the learning rate:

  1. When the loss function is either fluctuating or rapidly increasing. If the step size for each iteration is too large, the network might not find the local minima of the loss function and hence, cause fluctuations.
  2. When the loss function is not improving, and it is desirable to decrease it further. If the loss function is not improving, the network might wander around in a local optimum (neither maximum nor minimum) while ignoring the local minimum of the loss function. Lowering the learning rate might solve this problem.

When to raise the learning rate:

  1. When the loss function is steadily decreasing but learning is slow. In some cases where the loss function is steadily decreasing but learning is too slow, increasing the learning rate will not sacrifice performance.
  2. When the model is overfitting. A low learning rate might cause the model to overfit, as the minima might be too sharp. Overfitting will lead to a poor generalization ability, which is not desirable when training neural networks. In this case, increasing the learning rate would increase the generalization ability of the network, as this would cause the model to converge to a wider minimum.
  3. When it is desirable to decrease the loss function further. In some cases, increasing the learning rate will allow the parameters to converge to a better minimum decreasing the loss function further.

Finding the ideal learning rate is currently limited to trial and error. However, the Adam optimizer is less sensitive to the initial learning than other optimizers (e.g. SGD), so you should not try all available learning rates. Do not train the entire dataset at first, as the learning rate can be excluded after just a few iterations (e.g. when the loss function deviates, fluctuates or if learning is too slow). An optimal learning rate is found by trying various orders of 10.

Mini-Batch Size. When training a neural network, data is typically divided into smaller subsets of the entire dataset, where each subset is denoted as a mini-batch. Each mini-batch is used to evaluate the gradient of the loss function and update the weights of the training network. If your training curve is very unstable/fluctuating, each update of the network is very noisy. This can be caused by the fact that each input looks very different. By increasing the mini-batch size, you are now averaging the contribution of every single input, hence smoothing the gradient and the updates to the network. Having a GPU a lot of VRAM allows for increasing the mini-batch size. However, when running smaller hardware, the mini-batch size should be lowered. If the mini-batch size is increased, consider also increasing the learning rate. This can be done because a smoother gradient allows us to take a larger step in the update of the network each time.

Loss Function. The loss function is a function that evaluates the performance of the network optimization algorithm to a scalar value. The loss function is the mathematical formula that describes the problem the neural network should learn to solve. This means however, that not all problems should be described by the same loss function. In general, the network learns the solve the problem by minimizing the loss function, i.e. the lower, the better it solves the problem at-hand. In Visiopharm, we have included 3 different loss functions:

  1. Cross-entropy: Use when there are many regions that only include the background class and no foreground classes.
  2. Intersection-over-union (IoU): Use when a foreground class is frequently present and segmentation overlap is crucial.
  3. IoU + Cross-entropy: Special case.

Cross-entropy loss is the most common loss function in deep learning and it evaluates how well a classification algorithm performs.

LOSScrossEntropy=iclassesylabelilogPi\mathrm{LOSS}_{\text{crossEntropy}} = \sum_{i \,\in\, \text{classes}} y_{\text{label}_i} \cdot \log P_i

Where ylabely_{label} is the ground truth label and PP is the probability of the i-th class. This is calculated for each pixel, and then averaged across each image.

Intersection over Union loss can be used to evaluate the overlap between two objects; the ground truth and the prediction in our case.

LOSSIoU=1IoULOSS_{IoU} = 1 - IoU

where TP, FN, and FP are the area of true positive, false negative and false positive, respectively.

IoU + Cross-entropy loss is composed of the concepts of both IoU and cross-entropy, with the potential to improve the training process in both complexity and state-of-the-art performance. We found that this loss worked well for classes with very different area sizes such as brain regions when Context Aware-Region Mapping is enabled.

LOSSIoU+crossEntropy=LOSSIoU+LOSScrossEntropyLOSS_{IoU+crossEntropy} = LOSS_{IoU} + LOSS_{crossEntropy}

Loss Weighting. Loss weighting is only relevant for cross-entropy. Use with caution and do not make use of extreme values. The loss weighting function is used by the network optimization algorithm. Class specific loss weights can be defined below, where the optimal can be determined empirically, typically as a function of the inverse of the classes' relative size. Class constant weighting is a rebalancing strategy used for an imbalanced dataset. An imbalanced dataset occurs when the number of instances in each class are not represented equally. To rebalance the dataset, class constant weighting works by incorporating the class weights into the cross-entropy loss function. Ultimately, this will contribute to giving the minority class a higher weight whilst the dominating class will receive a lower weight.

Iterations. The numeric value in this field indicates how many iterations the training of a model should run before automatically stopping. The field is blank per default. Leaving this field blank causes there to be no automatic stopping.

Data Augmentation

The performance of CNNs (Convolutional Neural Networks) depends to a high degree on the amount of labeled training data but also on variability in the training data. Generally, deep CNNs aim to learn the variability of features that are significant for the classification and discarding the irrelevant features. In computer vision, this means that the label is most likely invariant to certain image transformations and utilizing this knowledge to perform data augmentation has been shown to increase the performance of deep neural networks.

Data augmentation refers to the process of applying specific image transformations on the available data without altering the image label, e.g. distortions to pixel intensities or spatial transformations. If performed correctly, the CNN is trained on more data variability, which results in more robust models.

The specific approach of data augmentation in Visiopharm is based on sequential operations generating a pipeline, where user-defined operations are applied to the dataset created of an image. Every operation in the pipeline is by default defined by a minimum probability parameter of 0.5. This probability parameter determines the probability that an operation is applied to the image feed to the network (the receptive field).

The augmentation techniques of the pipeline in Visiopharm contributes to color-, spatial- and stain invariance. When the image passes through each step of the pipeline, another image is returned creating artificially more data and more variations than in the original labeled training data.

The types of augmentation available are: Rotate, Flip, Brightness, Contrast, Hue, Saturation, H&E staining, and HDAB staining.

Rotate

Rotation augments the original image by a certain spatial rotation. This aims to make the network invariant to rotations, which is usually the case in histopathology (the tissue can be put in many ways onto the glass slide). Be aware that all transformations should be valid if you are analyzing anatomical sections.

Flip

Flip is a popular augmentation technique that allows for horizontal and vertical invariance. Again, this is usually a valid transformation in histopathology. This step can be added as Left to Right or Top to Bottom.

Brightness

Brightness changes the brightness of each input randomly by a user-defined factor with the selected probability

x=x+fmax(x),f[δ,δ]x' = x + f \cdot max(x), f \in [-\delta,\delta]

where f is the user defined factor, and δ\delta is the selected brightness value, and max(x) is the estimated maximum value of the input.

Contrast

Contrast changes the contrast of each input image with a random factor in a user-defined range with a selected probability

x=(xmean(x))f+mean(x),f[a,b]x' = (x - mean(x)) \cdot f + mean(x), f \in [a,b]

where f is the random factor and a and b are the lower and upper range values.

Hue

Hue changes the hue of the color in each input randomly between a user-defined factor with the selected probability. Initially, it involves the transformation from RGB-space to IHS-space followed by a perturbation of h.

h=h+f2π,f[δ,δ]h' = h + f \cdot 2\pi, f \in [-\delta,\delta]

where f is the random factor, and δ\delta is the selected hue value.

Saturation

Saturation changes the saturation of the color in each input in a user-defined range with a selected probability. Initially, it involves the transformation from RGB-space to IHS-space followed by a perturbation of s.

s=s+f,f[a,b]s' = s + f, f \in [a,b]

where f is the random factor, and a and b are the lower and upper range values.

H&E staining

This step uses a domain-specific data augmentation scheme that specifically augments H&E-stain intensities by randomly varying fixed H&E-stain vectors. Instead of randomly perturbing image colors, we deliberately augment the variability that is well-defined by the H&E-dyes. For digital image processing in histopathology, it is well-recognized that we can define fixed stain vectors for the H&E-colors, which is precisely what we take advantage of in our method. We use a color deconvolution method that uses optical density (OD) transform to project the image onto the H and E-stain vectors. You can control how much of each stain vector should be augmented, e.g. if you know that the eosin stain varies more than the hematoxylin stain in your data.