Dec. 3, 2019
Diagnose like a Radiologist. Attentionbased model on TensorFlow 2.0.
In the last few years, the attention mechanism has become a popular technique to improve the accuracy of neural networks, we will see what this mechanism is and how to implement the model from the paper. To implement this model we will use TensorFlow 2.0. This new version of TensorFlow is really good when we have to do multiple actions to train our models.
Attention mechanism
Some months ago I wrote a post about class activation maps, this technique shows us the most important areas of the input image that the neural network took into account to make the final prediction. We obtain these activation maps from the output feature maps of the layers of the neural network.
Activation maps. Image from the class activation maps post.
There are multiple ways to implement an attention mechanism, in fact, this technique comes from NLP models where we want the model to pay attention to the most important words in a phrase. We can use a trainable attention mechanism where the mechanism is trained while the network is trained, like in the Learn To Pay Attention paper where this mechanism is integrated inside the architecture and affects directly the network's outputs.
We can also use a trained model to obtain the activation maps and crop the most important area of the images and train a new model on these cropped images. This was the approach that the authors of the [Diagnose like a Radiologist: Attention Guided
Convolutional Neural Network for Thorax Disease
Classification](https://arxiv.org/abs/1801.09927) paper followed. The attention mechanism follows this idea of getting the activation maps from the layers of the neural network and use them to learn where the network should pay attention to improve the training and accuracy of the predictions.
The authors of the Diagnose like a Radiologist paper implemented a threebranch neural network, each branch is a convolutional neural network that resolves a multilabel classification problem. One of these networks presents the attention mechanism but before talking about these models we must know the problem that the paper is trying to solve.
Thorax Disease classification
In this paper the authors used the ChestXray14 dataset that contains 112,120 frontalview chest Xray images of 30,805 unique patients, each image is labeled with one or more types of 14 thorax diseases. In total, we have 15 classes, where the first class indicates if the patient is healthy or not.
Thorax diseases can happen in small localized areas, if we use the whole image to train a model, the model accuracy could be affected by the irrelevant areas of the image. However, some diseases like pneumonia have lesion areas distributed in the whole image, therefore we should take into account small and big lesion areas to perform a final prediction. To solve this problem, the authors used three models, a global model, a local model, and a fusion model.
Threebranch neural network
These models have the same aim, to predict correctly the diseases of the image, the global model takes as input the whole xray image from the ChestXray14 dataset that we call global image, the local model takes as input an image that we call local image, we obtain the local image using the activation maps from the global model, as I previously mentioned, these activation maps show the most important area of the image, and in this case the area contains the lesion of the disease. Due to this, the local model introduces the attention mechanism and the model can focus on the lesion area in a better way.
If we have a small lesion area the local model will perform better than the global model since the global model has to deal with irrelevant areas, whereas, if the lesion area is distributed in the whole image the local model is not good enough to make a good prediction, here the global model will perform better, for this reason, the fusion model combines the learning of these two models to perform a final strong prediction.
Threebranch neural network. Image from the Diagnose like a Radiologist: Attention Guided Convolutional Neural Network for Thorax Disease Classification paper.
We will check one by one these models and their architectures. I recommend you to follow this Jupyter Notebook to understand better the implementation.
Global Model
The global model uses as base network the ResNet50 or DenseNet121 architectures. In this case, we will implement the network using the DenseNet121 architecture since it's faster to train.
The following code is the implementation of this network:
def create_branch():
dense_model = DenseNet121(
weights="imagenet",
input_shape=image_size,
include_top=False)
feature_maps = dense_model.output
pooling_layer = layers.GlobalMaxPool2D()(feature_maps)
predictions = layers.Dense(num_classes, activation="sigmoid")(pooling_layer)
model = Model(inputs=dense_model.input, outputs=[predictions, feature_maps, pooling_layer])
weight_decay_value = 1e4
weight_decay = tf.keras.regularizers.l2(l=weight_decay_value)
for layer in model.layers:
layer.trainable = True
if hasattr(layer, 'kernel_regularizer'):
layer.kernel_regularizer = weight_decay
return model
Here, we can notice that we use the dense network pretrained on the imagenet dataset and our model has multiple outputs, these outputs are:

predictions: These are the predictions for each class. Since we have a multilabel classification problem we use the sigmoid function so each class has a value between 0 and 1 and the output is an array like [0, 0.45, 0.78, 0.23, 0.98].

feature_maps: These are the output feature maps of the last layer of the DenseNet model, we use these feature maps to get the activation maps and crop the global image to obtain the local image.

pooling_layer: This pooling layer is used to train the fusion model and get a prediction based on the decisions of the global and local models.
Finally, we use L2 or weight decay regularization in each convolutional layer to avoid overfitting.
Local Model
The local model uses the same architecture than the global model but these two models don't share weights and are trained on different iterations. We use the same create_branch function to create the global and local models, however, in the local model we ignore the feature maps since we don't need them.
Fusion model
The fusion model takes as input the last pooling layers of the global and local models:
def create_fusion_branch():
global_pool = layers.Input(shape=(1024))
local_pool = layers.Input(shape=(1024))
concat_layer = layers.concatenate([global_pool, local_pool])
model = layers.Dense(num_classes, activation="sigmoid")(concat_layer)
model = Model(inputs=[global_pool, local_pool], outputs=model)
return model
We concatenate these pooling layers and use a fully connected layer to obtain the predictions.
Training
To train these models we use the binary crossentropy loss function and the area under the curve metric:
def compute_loss(true_classes, pred_classes):
loss = tf.keras.losses.binary_crossentropy(true_classes, pred_classes)
return tf.reduce_mean(loss)
auc_metric = tf.keras.metrics.AUC()
def compute_metrics(true_classes, pred_classes):
auc_metric.reset_states()
auc_metric.update_state(true_classes, pred_classes)
auc_value = auc_metric.result()
return auc_value
We train each model in three stages:
Stage 1
Firstly, we train the global model using a batch size of 64:
@tf.function
def train_stage_1(batch_images, labels):
with tf.GradientTape() as tape:
pred_classes, _, _ = global_branch_model(batch_images, training=True)
loss = compute_loss(labels, pred_classes)
auc_value = compute_metrics(labels, pred_classes)
gradients = tape.gradient(loss, global_branch_model.trainable_variables)
global_branch_optimizer.apply_gradients(zip(gradients, global_branch_model.trainable_variables))
return loss, auc_value
The function above is computed for each batch multiples times during one epoch. This function updates the model's weights.
def train(epochs):
for epoch in range(epochs):
batch_time = time.time()
step = 1
for batch_images, labels in train_generator:
loss, auc_value = train_stage_1(batch_images, labels)
auc_value = auc_value.numpy()
loss = loss.numpy()
print('\r', 'Epoch', f"{epoch + 1}/{epochs}", ' Step', f"{step}/{train_steps}", ' loss:', loss, " AUC:", auc_value, " time:", time.time()  batch_time, end='')
step += 1
stage_1_loss_results.append(loss)
stage_1_auc_results.append(auc_value)
batch_time = time.time()
checkpoint.save(file_prefix=checkpoint_prefix)
if (epoch + 1) % 20 == 0:
current_lr = tf.keras.backend.get_value(global_branch_optimizer.lr)
new_lr = current_lr / 10
tf.keras.backend.set_value(global_branch_optimizer.lr, new_lr)
print('\r', 'Epoch', f"{epoch + 1}/{epochs}", ' Step', f"{step}/{train_steps}", ' loss:', loss, " AUC:", auc_value, " time:", time.time()  start)
The train function executes the train_stage_1 function multiple times to train the model, this function prints the loss and the auc values after each batch, also after 20 epochs, we decrease the learning rate value as the paper mentions.
Stage 2
Once we have the global model trained, we use it to get the local images and train the local model on these images. In this step we only train the local model and we don't update the global model's weights.
Local Images
To get the local images, we need a function to compute the activation maps of the global model and crop the lesion area of the global images:
def get_local_images(feature_maps, global_images):
batch_size = len(feature_maps)
local_images = np.zeros((batch_size, 224, 224, 3), dtype=np.float32)
for index in range(batch_size):
feature_map = feature_maps[index, ...]
global_image = global_images[index, ...]
l1_distance = tf.norm(feature_map, ord=1, axis=2) # outputs 7x7
l1_distance = l1_distance  tf.keras.backend.min(l1_distance)
heatmap = l1_distance / tf.keras.backend.max(l1_distance)
resized_heatmap = tf.image.resize(tf.expand_dims(heatmap, 2), image_size[0:2], method=tf.image.ResizeMethod.BILINEAR)
binary_mask = tf.where(resized_heatmap[..., 0] > treshhold, 1, 0)
labeled_mask, labels_indexes = label(binary_mask, connectivity=2, background=0, return_num=True)
elements_by_label = [np.sum(labeled_mask == label_index) for label_index in range(1, labels_indexes + 1)]
if len(elements_by_label) == 0:
max_connected_region = (labeled_mask == 1)
else:
max_label_index = np.argmax(elements_by_label) + 1
max_connected_region = (labeled_mask == max_label_index)
max_connected_region_stack = tf.stack([max_connected_region, max_connected_region, max_connected_region], axis=2)
masked_image = tf.cast(max_connected_region_stack, tf.float32) * global_image
mask_coords = np.argwhere(masked_image != 0)
min_y = min(mask_coords[:, 0])
min_x = min(mask_coords[:, 1])
max_y = max(mask_coords[:, 0])
max_x = max(mask_coords[:, 1])
local_image = global_image[min_y:max_y, min_x:max_x, :]
local_image = tf.image.resize(local_image, image_size[0:2], method=tf.image.ResizeMethod.BILINEAR)
local_images[index, ...] = local_image
return local_images
The function get_local_images will use the feature maps to compute the attention maps and crop the global images. Since we have multiple images we need a for loop.
In the paper, I found this part a little bit confusing so I searched how to implement the attention maps and I found out a pythorch implementation. The get_local_images function is a similar implementation but using TensorFlow.
To train the local model, we use a batch size of 32, therefore we need 32 feature maps and 32 global images. We will compute the attention maps one by one:
feature_map = feature_maps[index, ...]
global_image = global_images[index, ...]
The paper indicates that we can compute the l1 distance of the feature maps and then count the maximum values to get a heatmap:
l1_distance = tf.norm(feature_map, ord=1, axis=2)
l1_distance = l1_distance  tf.keras.backend.min(l1_distance)
heatmap = l1_distance / tf.keras.backend.max(l1_distance)
The size of each feature map is 7x7x1024, we compute the l1 distance across the channel axis (axis 2) and we end up with a feature map of size 7x7, then we subtract the minimum value of this feature map to the feature map itself, I am not sure about the reason behind the subtraction, however, if we do this we obtain a better heatmap. Finally, we divide the feature map by the maximum value to count the maximum values of the feature map to get our heatmap. We can see this heatmap as a grayscale image where each pixel has values between 0 and 1.
Our heatmap has a size of 7x7, we have to resize this heatmap to the same size as our input image (224, 224), we use a bilinear method where each new pixel will have a similar value than the near pixels but not the same.
resized_heatmap = tf.image.resize(tf.expand_dims(heatmap, 2), image_size[0:2], method=tf.image.ResizeMethod.BILINEAR)
We will compute a binary mask from our heatmap. If some pixel has a value greater than 0.7 we assign 1 to that pixel, otherwise we assign 0.
binary_mask = tf.where(resized_heatmap[..., 0] > treshhold, 1, 0)
The final binary mask looks like:
Where the yellow areas are ones and the purple areas are zeros.
To get the lesion area, we have to find out the maximum connected region of this binary mask, in other words we have to find out the biggest yellow area. To achieve this we will use the label function from skimage that labels connected pixels:
labeled_mask, labels_indexes = label(binary_mask, connectivity=2, background=0, return_num=True)
Here the label function considers the zero values as background and the connectivity indicates that a pixel must have at least 2 pixels close to another pixel to consider them of the same label. This function returns the number of areas labeled and the following labeled_mask where we can notice that now each area has a different color:
Now we have to count the number of pixels that each area has and, the label with more pixels is the maximum connected region:
elements_by_label = [np.sum(labeled_mask == label_index) for label_index in range(1, labels_indexes + 1)]
We also have to check if the mask was labeled or if the label function couldn't find any connected region:
if len(elements_by_label) == 0:
max_connected_region = (labeled_mask == 1)
else:
max_label_index = np.argmax(elements_by_label) + 1
max_connected_region = (labeled_mask == max_label_index)
If the mask is labeled, then we keep the maximum connected region:
maximum connected region:
This yellow area is the maximum connected region that contains the lesion area. If we multiply the global image by this connected region, we only keep the pixels of the image that belong to the lesion area:
max_connected_region_stack = tf.stack([max_connected_region, max_connected_region, max_connected_region], axis=2)
masked_image = tf.cast(max_connected_region_stack, tf.float32) * global_image
However, we need a square area to crop the global image correctly. For this reason, we need to find out the coordinates where the pixels are not zero:
mask_coords = np.argwhere(masked_image != 0)
np.argwhere
returns the indices of the elements that are nonzero, using these indices we find the minimum values to get the min_y and min_x coordinates and the maximum values to get the max_y and max_x coordinates:
min_y = min(mask_coords[:, 0])
min_x = min(mask_coords[:, 1])
max_y = max(mask_coords[:, 0])
max_x = max(mask_coords[:, 1])
With these coordinates, now we can crop the global image to obtain the local image:
local_image = global_image[min_y:max_y, min_x:max_x, :]
Finally, we have to resize the local image to the input size of the network (224, 224, 3):
local_image = tf.image.resize(cropped_image, image_size[0:2], method=tf.image.ResizeMethod.BILINEAR)
In the example above, I used a random numpy array as feature maps to show how the function works. However, if we use real feature maps from the global model, the heatmap and binary mask are a little bit different:
In this case, the binary mask and labeled mask are the same since we only have one connected region and, therefore, one label..
The input of the get_local_images function are the feature maps from the global model and global images like the following:
And the output are the local images:
We can notice how the local images focus on some areas of the global image. I have to mention that I used a global model trained only for one epoch to obtain these local images, a model trained for more epochs should do a better job localizing the lesion areas.
Training Local Model
The following function is quite similar than the function we previously used to train the global model each batch:
@tf.function
def train_stage_2(local_images, labels):
with tf.GradientTape() as tape:
pred_classes, _, _ = local_branch_model(local_images, training=True)
loss = compute_loss(labels, pred_classes)
auc_value = compute_metrics(labels, pred_classes)
gradients = tape.gradient(loss, local_branch_model.trainable_variables)
local_branch_optimizer.apply_gradients(zip(gradients, local_branch_model.trainable_variables))
return loss, auc_value
The function to train our model each epoch is the following one:
def train(epochs):
for epoch in range(epochs):
batch_time = time.time()
epoch_time = time.time()
step = 1
for batch_images, labels in train_generator:
_, feature_maps, _ = global_branch_model(batch_images, training=False)
local_images = get_local_images(feature_maps, batch_images)
loss, auc_value = train_stage_2(local_images, labels)
auc_value = auc_value.numpy()
loss = loss.numpy()
print('\r', 'Epoch', f"{epoch + 1}/{epochs}", ' Step', f"{step}/{train_steps}", ' loss:', loss, " AUC:", auc_value, " time:", time.time()  batch_time, end='')
step += 1
stage_2_loss_results.append(loss)
stage_2_auc_results.append(auc_value)
batch_time = time.time()
if (epoch + 1) % 20 == 0:
current_lr = tf.keras.backend.get_value(local_branch_optimizer.lr)
new_lr = current_lr / 10
tf.keras.backend.set_value(local_branch_optimizer.lr, new_lr)
checkpoint_stage2.save(file_prefix=checkpoint_stage2_prefix)
print('\r', 'Epoch', f"{epoch + 1}/{epochs}", ' Step', f"{step}/{train_steps}", ' loss:', loss, " AUC:", auc_value, " time:", time.time()  epoch_time)
Here we use the feature maps from the global model (global_branch_model) to obtain the local images (get_local_images). Finally, we use these local images to train the local network (train_stage_2).
Stage 3
Now that we have the local and global models trained, we can train the fusion model:
@tf.function
def train_stage_3(global_pooling_layers, local_pooling_layers, labels):
with tf.GradientTape() as tape:
pred_classes = fusion_branch_model([global_pooling_layers, local_pooling_layers], training=True)
loss = compute_loss(labels, pred_classes)
auc_value = compute_metrics(labels, pred_classes)
gradients = tape.gradient(loss, fusion_branch_model.trainable_variables)
fusion_branch_optimizer.apply_gradients(zip(gradients, fusion_branch_model.trainable_variables))
return loss, auc_value
This model takes as input the pooling layers of the previous models and not the images.
def train(epochs):
for epoch in range(epochs):
batch_time = time.time()
epoch_time = time.time()
step = 1
for batch_images, labels in train_generator:
_, feature_maps, global_pooling_layers = global_branch_model(batch_images, training=False)
local_images = get_local_images(feature_maps, batch_images)
_, _, local_pooling_layers = local_branch_model(local_images, training=False)
loss, auc_value = train_stage_3(global_pooling_layers, local_pooling_layers, labels)
auc_value = auc_value.numpy()
loss = loss.numpy()
print('\r', 'Epoch', f"{epoch + 1}/{epochs}", ' Step', f"{step}/{train_steps}", ' loss:', loss, " AUC:", auc_value, " time:", time.time()  batch_time, end='')
step += 1
stage_3_loss_results.append(loss)
stage_3_auc_results.append(auc_value)
batch_time = time.time()
if (epoch + 1) % 20 == 0:
current_lr = tf.keras.backend.get_value(fusion_branch_optimizer.lr)
new_lr = current_lr / 10
tf.keras.backend.set_value(fusion_branch_optimizer.lr, new_lr)
checkpoint_stage3.save(file_prefix=checkpoint_stage3_prefix)
print('\r', 'Epoch', f"{epoch + 1}/{epochs}", ' Step', f"{step}/{train_steps}", ' loss:', loss, " AUC:", auc_value, " time:", time.time()  epoch_time)
We can see that we are using the global and local models to get the pooling layers and, at the same time, we are using the global model to get the local images.
The Notebook
In this post, we have seen some parts of the code to create the models and train them. However, I recommend checking the notebook to review the whole process. In the notebook, there is the code to get the dataset and create the generators.
If you want to test the get_local_images function or print the images using a tool like matplotlib then you must use the following code to see the images correctly:
imagenet_mean = np.array([103.939, 116.779, 123.68])
image = tf.math.add(image, imagenet_mean) / 255.0
I couldn't train the models on Google Colab since the process was taking a lot of time. I hope this code is useful if you want to implement this architecture and also if you want to check TensorFlow 2.0 and its workflow.
Identity Confounding
Before ending this post, I would like to talk about one more thing. Perhaps you have seen or read that medical data is not good enough to train machine learning or deep learning models. I am not an expert in this field but, I know that if we train models to diagnose diseases, we should be very careful since there are lives depending on the decisions of our model. Sometimes, we can get a really good accuracy value or even good sensitivity and specificity values, which are better metrics if our dataset is not balanced. However, the quality of our models depends in a big way on the quality of the dataset.
The authors of this paper used the ChestXray14 dataset, this paper is from January 2018 and the dataset is a little bit older, ChestXray14 is a dataset that contains incorrect labels for a lot of images, even when the average area under the curve across the 14 diseases is 87.1 on the test set, we can not say that this model is good to use in reallife environments. This good test set result is due to the consistent of the data, both sets are wrongly labeled, the model can still fit the data but in production, the predictions are not useful.
To train these models the authors split the dataset into training, validation and test sets randomly. Therefore, the xrays of a patient can appear in multiple sets, this introduces a problem called Identity Confounding where the model learns to identify the patients instead of the disease since the first tasks is easier than the last one, when we test the model on the validation or test sets the model that learned to identify patients, sees the same patient and knows if has some disease or not instead of identifying the disease from the image.
Identity Confounding is not the only problem that we can find in medical images, several problems can fool us and make us think that our model is good enough. These problems also cause difficulties to human doctors, to build a good dataset and a good model, researches must work along with human doctors to understand the workflow of the latter.
In this paper the authors replicate the workflow that radiologists follow, first the radiologist checks the whole image area to discover the lesion, once the lesion area has been identified, the radiologist will concentrate on this area, finally, the radiologist will consider both areas to make a final prediction. The idea of mimic the workflow of real doctors is really good to implement new architectures.