Training and Deploying an Image Captioning System

COCOComplete (3)
Model Diagram

Demo

Upload your own photo to be captioned:

















I don't store your uploaded files anywhere




For the rest of this post I show an end-to-end training of the captioning system in a reproducible jupyter notebook style. This notebook was run on Google Colab on a high-ram GPU-accelerated runtime. All code for training and deployment is also available here.

Download the data from the coco site.

In [1]:
!curl -o "annotations_trainval2014.zip" http://images.cocodataset.org/annotations/annotations_trainval2014.zip
!unzip "annotations_trainval2014.zip"
!curl -o "train2014.zip" http://images.cocodataset.org/zips/train2014.zip
!curl -o "val2014.zip" http://images.cocodataset.org/zips/val2014.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  241M  100  241M    0     0  86.2M      0  0:00:02  0:00:02 --:--:-- 86.2M
Archive:  annotations_trainval2014.zip
  inflating: annotations/instances_train2014.json  
  inflating: annotations/instances_val2014.json  
  inflating: annotations/person_keypoints_train2014.json  
  inflating: annotations/person_keypoints_val2014.json  
  inflating: annotations/captions_train2014.json  
  inflating: annotations/captions_val2014.json  
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.5G  100 12.5G    0     0  39.6M      0  0:05:24  0:05:24 --:--:-- 89.1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6337M  100 6337M    0     0  38.8M      0  0:02:42  0:02:42 --:--:-- 16.3M

import some stuff for initial data processing

In [0]:
import matplotlib.pyplot as plt
from io import BytesIO
import cv2
import json
import numpy as np
import zipfile
import torchtext
import string
import torch

now im going to iterate through all the pics in the archives and resize them to 224x224 while preserving the aspect ratios (by padding with black pixels). Ill save each resized picture to a numpy array to use later when building the model.

In [0]:
base_path = "."
ds_to_fn = {'train':'train2014.zip','val':'val2014.zip'}
size = 224
In [0]:
def pad_image(img, height, width):
    h, w = img.shape[:2]
    t = 0
    b = height - h
    l = 0
    r = width - w
    return cv2.copyMakeBorder(img, t, b, l, r, 
                              cv2.BORDER_CONSTANT, value=0)

def resize_and_pad(img, height, width, resample=cv2.INTER_AREA):
    if len(img.shape)==2:
        img = np.stack([img,img,img],axis=2)
    target_aspect_ratio = height/width
    im_h, im_w, _ = img.shape
    im_aspect_aspect_ratio = im_h/im_w
    if im_aspect_aspect_ratio>target_aspect_ratio:
        target_height = height
        target_width = int(im_w * target_height/im_h)
    else:
        target_width = width
        target_height = int(im_h * target_width/im_w)
    resized = cv2.resize(img, (target_width, target_height),
                         interpolation=resample)
    return pad_image(resized, height, width)
In [0]:
pics, im_fn_to_index = {}, {}
for ds in ['train','val']:
    fn = ds_to_fn[ds]
    archive = zipfile.ZipFile(f"{base_path}/{fn}")
    file_list = archive.filelist
    pics[ds] = np.zeros((len(file_list)-1,size,size,3),dtype=np.uint8)
    im_fn_to_index[ds] = {}
    for count,file_obj in enumerate(file_list):
        im_fn = file_obj.filename
        if not im_fn.endswith('.jpg'):
            continue
        with archive.open(file_obj) as open_file:
            res = BytesIO(open_file.read())
            pic = plt.imread(res,'jpg')
        ind = len(im_fn_to_index[ds])
        pics[ds][ind] = resize_and_pad(pic, size, size)
        im_fn_to_index[ds][im_fn] = ind
    archive.close()

for word embeddings i am going to use the pretrained GLoVE embeddings that are downloadable from torchtext. i chose to take the 100k most common words since less frequent words will add more dimensions to the output space without much benefit since most words in the captions are simple and therefore common. i also exclude any words with punctuation other than a dash or apostrophe or any words with an uppercase letter since i will only use lowercase text.

In [6]:
vocab = torchtext.vocab.GloVe(name='840B', dim=300, max_vectors=100000)
.vector_cache/glove.840B.300d.zip: 2.18GB [16:54, 2.15MB/s]                           
  5%|▍         | 99408/2196017 [00:09<03:28, 10063.89it/s]
In [0]:
punctuation = set(c for c in string.punctuation if c not in "-'")
digits = set(str(i) for i in range(10))

inds_to_use = []
seen_lower = set()
upper_added = {}
words = set()
for i,word in enumerate(vocab.itos):
    if not any(c in punctuation or c in digits for c in word):
        if not all(c in "-'" for c in word) and word.islower():
            inds_to_use.append(i)
            words.add(word)

vocab.itos = np.array(vocab.itos)[inds_to_use]
vocab.stoi = {s:i for i,s in enumerate(vocab.itos)}
vocab.vectors = vocab.vectors[inds_to_use]
In [8]:
## size of remaining vocab
len(vocab.stoi),len(vocab.itos),vocab.vectors.size()
Out[8]:
(41746, 41746, torch.Size([41746, 300]))

install some libraries that i use for pretrained building blocks

In [9]:
!pip install efficientnet_pytorch
!pip install transformers
Collecting efficientnet_pytorch
  Downloading https://files.pythonhosted.org/packages/b8/cb/0309a6e3d404862ae4bc017f89645cf150ac94c14c88ef81d215c8e52925/efficientnet_pytorch-0.6.3.tar.gz
Requirement already satisfied: torch in /usr/local/lib/python3.6/dist-packages (from efficientnet_pytorch) (1.5.0+cu101)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from torch->efficientnet_pytorch) (1.18.5)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from torch->efficientnet_pytorch) (0.16.0)
Building wheels for collected packages: efficientnet-pytorch
  Building wheel for efficientnet-pytorch (setup.py) ... done
  Created wheel for efficientnet-pytorch: filename=efficientnet_pytorch-0.6.3-cp36-none-any.whl size=12422 sha256=3c070e4d1baf1641c460b24655bb552213f1ffdf78f29b9fa830390a9ca7165d
  Stored in directory: /root/.cache/pip/wheels/42/1e/a9/2a578ba9ad04e776e80bf0f70d8a7f4c29ec0718b92d8f6ccd
Successfully built efficientnet-pytorch
Installing collected packages: efficientnet-pytorch
Successfully installed efficientnet-pytorch-0.6.3
Collecting transformers
  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
     |████████████████████████████████| 675kB 3.5MB/s 
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from transformers) (1.18.5)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers) (3.0.12)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers) (2019.12.20)
Collecting sentencepiece
  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
     |████████████████████████████████| 1.1MB 15.5MB/s 
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from transformers) (4.41.1)
Collecting tokenizers==0.7.0
  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
     |████████████████████████████████| 3.8MB 24.4MB/s 
Collecting sacremoses
  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
     |████████████████████████████████| 890kB 42.5MB/s 
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from transformers) (20.4)
Requirement already satisfied: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from transformers) (0.7)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2.9)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->transformers) (2020.4.5.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (1.12.0)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (7.1.2)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers) (0.15.1)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->transformers) (2.4.7)
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... done
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893260 sha256=50458b0730a1d32307c4a7a67e21070b13d99080fd955c82a2827fc2d8c19c6d
  Stored in directory: /root/.cache/pip/wheels/29/3c/fd/7ce5c3f0666dab31a50123635e6fb5e19ceb42ce38d4e58f45
Successfully built sacremoses
Installing collected packages: sentencepiece, tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.43 sentencepiece-0.1.91 tokenizers-0.7.0 transformers-2.11.0

import everything i will need for defining and training the neural net

In [0]:
import torch.nn as nn
import torchvision
import efficientnet_pytorch
import transformers
import scipy.stats


base_path = "."

load all the data and labels and define all the hash maps that i will use later during training

In [0]:
with open(f'{base_path}/annotations/captions_train2014.json','r') as f:
    annot_train = json.load(f)

with open(f'{base_path}/annotations/captions_val2014.json','r') as f:
    annot_val = json.load(f)
In [12]:
LONGEST_CAPTION = max(len(d['caption'].split())
                      for d in annot_train['annotations'] +\
                      annot_val['annotations'])
LONGEST_CAPTION
Out[12]:
50
In [0]:
train_pics = pics['train']
val_pics = pics['val']
train_immap = im_fn_to_index['train']
val_immap = im_fn_to_index['val']
In [0]:
train_fn_to_index = {key.split('/')[1]:val 
                     for key,val in train_immap.items()}
train_index_to_fn = {val:key for key,val in train_fn_to_index.items()}

val_fn_to_index = {key.split('/')[1]:val for key,val in val_immap.items()}
val_index_to_fn = {val:key for key,val in val_fn_to_index.items()}
In [0]:
train_imfn_to_imid = {d['file_name']:d['id'] 
                      for d in annot_train['images']}
train_imid_to_caption = {d['image_id']:d['caption']
                         for d in annot_train['annotations']}
train_imfn_to_caption = {fn:train_imid_to_caption[id_]
                         for fn,id_ in train_imfn_to_imid.items()}

val_imfn_to_imid = {d['file_name']:d['id'] for d in annot_val['images']}
val_imid_to_caption = {d['image_id']:d['caption']
                       for d in annot_val['annotations']}
val_imfn_to_caption = {fn:val_imid_to_caption[id_]
                       for fn,id_ in val_imfn_to_imid.items()}

define a function to show some training images and their provided captions from the training annotations. view a random 10 images and their captions

In [0]:
def show_im_and_cap_train(indexes):
    for index in indexes:
        imfn = train_index_to_fn[index]
        caption = train_imfn_to_caption[imfn]
        fig = plt.figure(figsize=(7,7))
        plt.imshow(train_pics[index])
        plt.title(caption)
        plt.show()
In [17]:
show_im_and_cap_train(np.random.randint(0,len(train_pics),10))

i add two special tokens to the vocabulary - an end token and a unknown token. the unknown token will be needed for the embedding vector, the end token will be needed for the output space since it needs to be predicted to end a generated sequence

In [0]:
UNK_TOK = '~~UNK~~'
vocab.itos = np.concatenate([[UNK_TOK],vocab.itos])
vocab.stoi = {v:k+1 for v,k in vocab.stoi.items()}
vocab.stoi[UNK_TOK] = 0 ## ends up being 1
vocab.vectors = torch.cat([torch.zeros(1,300),vocab.vectors])

END_TOK = '~~END~~'
vocab.itos = np.concatenate([[END_TOK],vocab.itos])
vocab.stoi = {v:k+1 for v,k in vocab.stoi.items()}
vocab.stoi[END_TOK] = 0
vocab.vectors = torch.cat([torch.zeros(1,300),vocab.vectors])

next i save the captions as an array of integers which the will be used to look up the embeddings. i also save an accompanying mask vector which will be used to determine which words to backprop on. since the sequences are padded to the maximum length over all the captions, much of the captions array is just padding which is not useful for teaching the model.

In [0]:
train_captions = torch.zeros(train_pics.shape[0],
                             LONGEST_CAPTION+1,dtype=torch.long)
train_loss_mask = torch.zeros(train_pics.shape[0],
                              LONGEST_CAPTION+1,dtype=torch.bool)
for i in range(train_pics.shape[0]):
    caption = train_imfn_to_caption[train_index_to_fn[i]]
    split = caption.split()
    for word_ind, word in enumerate(split):
        word = word.lower().replace('.','').replace(',','').replace(';','')
        if not word:
            continue
        if word in vocab.stoi:
            train_captions[i, word_ind] = vocab.stoi[word]
        else:
            #print(word)
            train_captions[i, word_ind] = vocab.stoi[UNK_TOK]
        train_loss_mask[i, word_ind] = True
    train_captions[i, word_ind + 1] = vocab.stoi[END_TOK] ## cause it should generate end tok
    train_loss_mask[i, word_ind + 1] = True

val_captions = torch.zeros(val_pics.shape[0],
                           LONGEST_CAPTION+1,dtype=torch.long)
val_loss_mask = torch.zeros(val_pics.shape[0],
                            LONGEST_CAPTION+1,dtype=torch.bool)
for i in range(val_pics.shape[0]):
    caption = val_imfn_to_caption[val_index_to_fn[i]]
    split = caption.split()
    for word_ind, word in enumerate(split):
        word = word.lower().replace('.','').replace(',','').replace(';','')
        if not word:
            continue
        if word in vocab.stoi:
            val_captions[i, word_ind] = vocab.stoi[word]
        else:
            val_captions[i, word_ind] = vocab.stoi[UNK_TOK]
        val_loss_mask[i, word_ind] = True
    val_captions[i, word_ind + 1] = vocab.stoi[END_TOK]
    val_loss_mask[i, word_ind + 1] = True

i define a torch dataset, the transforms for the images (the same transforms used to train efficientnet since i am using the pretrained efficientnet-b0 encoder plus adding some noise to reduce overfitting), and a dataloader with batchsize 16.

In [0]:
batch_size = 84

class Dataset(torch.utils.data.Dataset):
    def __init__(self, pics, captions, loss_mask, pic_transform):
        self.pics = pics
        self.captions = captions
        self.loss_mask = loss_mask
        self.pic_transform = pic_transform
    
    def __len__(self):
        return self.pics.shape[0]
    
    def __getitem__(self, idx):
        return {'pics':self.pic_transform(self.pics[idx]),
                'captions':self.captions[idx],
                'loss_mask':self.loss_mask[idx]}

class AddGaussianNoise():
    def __init__(self, mean=0., std=.25):
        self.std = std
        self.mean = mean

    def __call__(self, tensor):
        return tensor + torch.randn(tensor.size()) * self.std + self.mean

normalize = torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                                             std=[0.229, 0.224, 0.225])
train_transform = torchvision.transforms.Compose([
                                        torchvision.transforms.ToTensor(),
                                        normalize, AddGaussianNoise()
                                                 ])
train_ds = Dataset(train_pics, train_captions, 
                   train_loss_mask, train_transform)
train_dl = torch.utils.data.DataLoader(train_ds, 
                                       batch_size=batch_size, 
                                       shuffle=True)

val_transform = torchvision.transforms.Compose([
                                        torchvision.transforms.ToTensor(),
                                        normalize])
val_ds = Dataset(val_pics, val_captions, val_loss_mask, val_transform)
val_dl = torch.utils.data.DataLoader(val_ds,
                                     batch_size=batch_size, shuffle=False)

next i define the rnn model. i load the pretrained efficientnet encoder and use the embedding of the image (after average pooling) as the first hidden state of the GRU. two predict a word from each hidden state i use a single linear layer with dropout in between. During inference I use sampling of the top 2 predictions to generate 10 candidate sequences and then use a pretrained language model to find sentence with the highest predicted probability in order to avoid some of the obvious gramatical issues that can happen with language generation.

In [0]:
def get_lm_score():
    tokenizer = transformers.GPT2Tokenizer.from_pretrained('distilgpt2')
    lm = transformers.GPT2LMHeadModel.from_pretrained('distilgpt2')
    lm.eval()
    for parameter in lm.parameters():
        parameter.requires_grad = False
    max_length = 86
    def lm_score(sents):
        ## sents should be a list of strings
        inds = torch.zeros(len(sents),max_length,dtype=torch.long)
        mask = torch.ones(len(sents),max_length,dtype=torch.float)
        for i in range(len(sents)):
            tok = tokenizer.encode_plus(sents[i], add_special_tokens=True, 
                                        return_tensors='pt',
                                max_length=max_length)['input_ids'][0]
            inds[i, :len(tok)] = tok
            mask[i, len(tok):] = 0
        logits = lm(inds)[0]
        inds_flattened = inds.flatten()
        indexer = torch.arange(0,inds_flattened.size(0),dtype=torch.long)
        chosen_words = logits.view(logits.size(0)*logits.size(1),-1)[indexer,inds_flattened]
        chosen_words = chosen_words.view(logits.size(0),logits.size(1))
        lm_scores = nn.functional.logsigmoid(chosen_words * mask).sum(1).numpy()
        lm_scores /= mask.sum(1).numpy()
        return lm_scores
    return lm_score
    
class Captioner(nn.Module):
    def __init__(self, vocab):
        super().__init__()
        self.vocab = vocab
        self.vocab_size = len(vocab.itos)
        self.word_emb_size = 300
        self.encoder = efficientnet_pytorch.EfficientNet.from_pretrained('efficientnet-b0')
        self.pic_emb_size = 1280
        self.average_pooling = nn.AdaptiveAvgPool2d(1)
        self.dropout = nn.Dropout(p=.2)
        self.decoder = nn.GRU(input_size=self.word_emb_size,
                              hidden_size=self.pic_emb_size,
                              batch_first=True)
        self.classifier = nn.Linear(self.pic_emb_size,self.vocab_size)
        self.start_tok_embed = nn.Parameter(torch.randn(self.word_emb_size,dtype=torch.float32))
        self.lm_score = get_lm_score()
        
    def forward(self, ims, caption_embs):
        bs = ims.size(0)

        im_embs = self.encoder.extract_features(ims)
        im_embs = self.average_pooling(im_embs).view(bs,self.pic_emb_size)
        hidden = im_embs.unsqueeze(0)

        caption_embs = torch.cat([self.start_tok_embed.expand(bs,1,
                                  self.word_emb_size),caption_embs],
                                 axis=1)
        out, _ = self.decoder(caption_embs,hidden)
        out = self.dropout(out.reshape(bs*caption_embs.size(1),-1))
        out = self.classifier(out)
        out = out.view(bs,caption_embs.size(1),-1)
        return out

    def inference(self, im, device, num_sample=10, max_length=32, topk=2):
        with torch.no_grad():
            sents = []
            for it in range(num_sample):
                bs = 1
                ims = im.unsqueeze(0)
                im_embs = self.encoder.extract_features(ims)
                im_embs = self.average_pooling(im_embs).view(bs,self.pic_emb_size)
                hidden = im_embs.unsqueeze(0)
                word_emb = self.start_tok_embed.expand(bs,1,self.word_emb_size)
                preds = []
                for i in range(max_length):
                    _, hidden = self.decoder(word_emb, hidden)
                    pred = self.classifier(hidden.squeeze(0)).squeeze()
                    pred = nn.functional.softmax(pred,dim=0)
                    top_preds = torch.topk(pred,topk)
                    top_preds_inds = top_preds.indices.cpu().numpy()
                    top_preds_values = top_preds.values.cpu().numpy()
                    top_preds_values = top_preds_values[top_preds_inds!=1]
                    top_preds_inds = top_preds_inds[top_preds_inds!=1]
                    top_preds_values = top_preds_values/top_preds_values.sum()
                    pred = np.random.choice(top_preds_inds,p=top_preds_values)
                    if pred==0:
                        break
                    word_emb = self.vocab.vectors[pred].view(bs,
                               1,self.word_emb_size).to(device)
                    preds.append(self.vocab.itos[pred])
                sents.append(' '.join(preds))
            scores = self.lm_score(sents)
            print(sents)
            return sents[np.argmax(scores)] 

def get_word_embs(vocab, word_inds):
    words = word_inds[:,:-1]
    size = words.size()
    return vocab.vectors[words.flatten()].view(size[0],size[1],300)
In [22]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = Captioner(vocab)
model = model.to(device)
for param in model.encoder.parameters():
    param.requires_grad = False
model.encoder.eval()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
Downloading: "https://github.com/lukemelas/EfficientNet-PyTorch/releases/download/1.0/efficientnet-b0-355c32eb.pth" to /root/.cache/torch/checkpoints/efficientnet-b0-355c32eb.pth
Loaded pretrained weights for efficientnet-b0




Finally I train the model for 10 epochs on the training set and monitor cross entropy loss on the train and val set and run the inference algorithm on 3 randomly selected val photos each epoch.

In [23]:
for epoch in range(7):

    train_losses = []
    val_losses = []

    model.train()
    model.encoder.eval()

    for i,batch in enumerate(train_dl):
        pics = batch['pics'].to(device)
        caption_embs = get_word_embs(vocab, batch['captions']).to(device)
        loss_mask = batch['loss_mask'].to(device)
        preds = model(pics, caption_embs)[loss_mask]
        labels = batch['captions'].to(device)[loss_mask]
        loss = criterion(preds,labels)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(),2)
        optimizer.step()
        optimizer.zero_grad()
        train_losses.append(loss.item())
    
    model.eval()

    with torch.no_grad():
        for i,batch in enumerate(val_dl):
            pics = batch['pics'].to(device)
            caption_embs = get_word_embs(vocab, batch['captions']).to(device)
            loss_mask = batch['loss_mask'].to(device)
            preds = model(pics, caption_embs)[loss_mask]
            labels = batch['captions'].to(device)[loss_mask]
            loss = criterion(preds,labels)
            val_losses.append(loss.item())
    
    rand_val_exs = np.random.randint(0,len(val_ds),size=3)
    for idx in rand_val_exs:
        item = val_ds[idx]
        im = item['pics'].to(device)
        cap = model.inference(im, device)
        plt.imshow(val_pics[idx])
        plt.title(cap)
        plt.show()

    print(f"epoch: {epoch}, tr_loss: {np.mean(train_losses)}, "
          f"vl_loss: {np.mean(val_losses)}")
['a cat sitting on a bed in a room', 'a dog laying on the bed in the bed', 'a dog sitting on a bed with a dog laying on the bed', 'a dog sitting on a bed with two cats laying on a bed', 'a cat sitting on a bed with a dog laying on the bed', 'a cat laying on a bed with a dog', 'a dog sitting in a bed with a bed in the background', 'a cat sitting on a bed with a dog laying on the bed', 'a dog sitting on a bed with two cats laying in a bed', 'a dog laying down on the bed with a bed']
['a pizza that is sitting on the table with a pizza', 'a pizza with a pizza on top of a table with a knife', 'a pizza that has a slice of pizza on it', 'a pizza with pizza on a plate', 'a pizza with a slice of pizza and a knife', 'a plate of pizza with a pizza and a knife and a pizza', 'a pizza with pizza and a slice of pizza', 'a pizza with a slice of pizza on a table', 'a pizza with a pizza and cheese on a table', 'a pizza with a pizza and a pizza on it']
['a couple of people sitting on a table with a pizza', 'a man is sitting in a green and white plate', 'two people sitting in a white and white plate', 'a couple of pizza sitting on a table with a sandwich', 'a couple of people sitting in a white and white', 'a man is on a plate with a pizza on it', 'a couple of people sitting on a table with a sandwich on it', 'a man is sitting on a plate of food', 'a man is sitting in a green shirt and a white dog', 'a couple of people sitting on a table with a pizza']
epoch: 0, tr_loss: 3.8699993586685304, vl_loss: 3.153461958804239
['a large building that has a large clock tower', 'a large clock tower with a clock on it', 'a big building that has a clock tower on it', 'a large clock on the top of a city', 'a big clock tower in the middle of a building', 'a large building with a clock on top', 'a big clock on a city street with a clock on top of it', 'a large building that is on the top of a tower', 'a big building with a clock on the top of it', 'a big building with a large clock on top of it']
['a vase of a vase on a table with a vase of it', 'a small vase with a vase of it is on the table', 'the small table has a glass vase on the table', 'a vase with a flower on it on it', 'a small vase with a vase on it', 'a vase with a vase on it', 'a small vase of flowers on a table', 'a small vase of a vase with a vase of flowers', 'a small vase with a vase on it', 'a vase with a vase of flowers on it']
['a large airplane is flying over the water', 'a large plane flying over a blue sky', 'a large airplane is flying in the ocean with a blue sky', 'a person on the beach flying a kite', 'a large airplane is flying over a blue sky', 'a person on a beach flying a kite on a beach', 'the ocean is on a beach with a large body of water', 'a person flying a kite on a clear blue sky', 'a large airplane flying over the water on a beach', 'a large plane flying over the water on a clear blue sky']
epoch: 1, tr_loss: 2.9530043923588845, vl_loss: 2.899441791617352
['a person on a skateboard in the middle of a river', 'a man on a skateboard in the middle of a road', 'a man on the ground on a beach', 'a man is riding a bike on a sidewalk', 'a person on a surfboard on a beach', 'a man is riding a bike on a beach', 'a person on a skateboard on a city street near a body of water', 'a person on a surfboard on a beach', 'a man is on the beach with a dog on it', 'a man on a skateboard on a beach']
['a group of people playing frisbee in a park', 'a man in a park holding a frisbee', 'a man and a boy are playing with a frisbee', 'a man and a woman are playing frisbee in a park', 'a man and woman are playing frisbee in a park', 'a group of people playing frisbee in a park', 'a man and woman playing frisbee on a park', 'a man and a woman playing frisbee on a park', 'a man in the grass is holding a frisbee', 'a man and a woman playing frisbee in the park']
['two young men playing a game of tennis', 'a young boy is riding a skate board', 'a young boy is riding his skateboard in a field', 'a man is riding a skate board in the air on a skateboard', 'a man is riding a skate board on a ramp', 'a man in a red shirt is riding a skateboard', 'a young man in the middle of a skateboard on a skateboard', 'two young men playing tennis on a court', 'a man is riding a skateboard on a slope', 'a young man is riding a skateboard on the side of a mountain']
epoch: 2, tr_loss: 2.6968131365205403, vl_loss: 2.791331387948299
['a group of train tracks in a grassy area', 'a train is traveling down the tracks near a river', 'a group of train tracks in a grassy area', 'a train on the tracks with a train on the side', 'a train is traveling down a track near a bridge', 'the train is going down the tracks in the water', 'a train is going down a bridge with people in the background', 'a train is going through the track in the countryside', 'a train on a track with many people walking down', 'a train on the track with a train on the side of it']
['a couple of people standing next to a small child holding a teddy bear', 'a young boy and a woman holding a frisbee', 'a couple of men standing next to each other in a park', 'a couple of men and two girls sitting next to each other', 'a couple of men standing next to each other', 'two young boys and a young girl holding a baseball bat', 'a young girl holding a frisbee in front of a group of people', 'two men and a young girl sitting next to each other', 'a young boy and a young boy holding a frisbee', 'a couple of people standing next to a man holding a frisbee']
['several people sitting on a bench in the water', 'a group of boats are in the water', 'a group of boats are on the beach', 'several people sitting on a beach near a body of water', 'several people sitting on a beach with surfboards', 'a couple of boats are sitting on the beach', 'several boats on the water in a lake', 'a group of boats are in the water', 'a group of boats on the water near a body of water', 'a group of boats on the beach near the water']
epoch: 3, tr_loss: 2.521069379413829, vl_loss: 2.739822044629241
['a woman is playing a video with her cell phone', 'a woman is holding a cell phone in front of a tv', 'a woman holding up a cell phone while standing on a table', 'two men are playing a video game on a tv screen', 'a man is holding a video game controller in his hand', 'a man is holding a cell phone in his hand', 'a man holding a cell phone to his ear', 'a woman holding a cell phone to her ear', 'a man holding a cell phone to his ear', 'a woman holding a smart phone in front of a television screen']
['a man is flying a kite on a beach', 'a girl in the air with an orange kite', 'a man is holding a kite while another man watches', 'a girl in a red shirt and a kite flying in the air', 'a girl in a pink shirt and a pink kite', 'a man is flying a colorful kite on the beach', 'a man in a red shirt and a pink frisbee', 'a man is holding a frisbee in a field with people', 'a girl in the air with a kite in the sky', 'a girl in a red shirt and a pink kite flying in the sky']
['a plate of food with a salad salad on it', 'a plate of pasta with a salad and a salad', 'a bowl of noodles with broccoli and noodles', 'a bowl with a salad salad and salad on a plate', 'a plate with a salad and a salad on it', 'a plate of food with a salad and a bowl of salad', 'a plate of food with a fork and a bowl of noodles', 'a plate of food with a salad salad and a salad', 'a plate with a variety of food on it', 'a plate of pasta and vegetables on a table']
epoch: 4, tr_loss: 2.3757876938787, vl_loss: 2.714766049730605
['a bathroom that has a toilet and a window in it', 'the bathroom has a white toilet and sink', 'the kitchen is decorated with blue and white tiles', 'a bathroom with white walls and blue tiles', 'a kitchen with white cabinets a silver stove and a microwave and a cupboard', 'the bathroom is clean and ready for us', 'a kitchen with a white sink and a window on the wall', 'a kitchen with a white sink a sink and a window', 'a bathroom with a white sink and a toilet', 'a kitchen that is next to a sink']
['a group of people in a ski lift posing on a ski slope', 'a man in skis is in the snow', 'a man and a woman posing for the camera', 'a man in a red jacket is skiing down a hill', 'a man and a woman are skiing on a ski slope', 'a man and a woman are in a ski slope', 'a group of people standing in the snow with snowboards', 'a group of people in ski gear standing in the snow', 'a man in a blue jacket is skiing in the snow', 'two children with skis and ski poles']
['a living room with a desk and a computer monitor', 'a room with a desk a laptop a monitor a monitor and a laptop', 'a living room with two laptops and a printer on it', 'a room that has a desk and a computer on it', 'a room with two monitors and a printer on a table', 'a living room with a desk and a computer monitor', 'a room with a laptop and two monitors on it', 'a room that has a desk with a laptop on it', 'a living room with a desk and two computer monitors', 'a living room has a large computer desk and two monitors']
epoch: 5, tr_loss: 2.2467936573599223, vl_loss: 2.710599652975489
['a kitchen has a stove a sink and a refrigerator', 'a kitchen with a stove and a refrigerator in the middle', 'a kitchen with a refrigerator and a stove in the corner', 'a kitchen with a stove a refrigerator and some cabinets', 'a kitchen with a refrigerator and a stove', 'a kitchen with a stove and a refrigerator', 'a kitchen with a stove and refrigerator and a stove', 'a kitchen with a refrigerator and a stove', 'a kitchen with a stove and a refrigerator in it', 'a kitchen with a stove a refrigerator and some cabinets']
['a close up of a sandwich on a plate', 'a large sandwich with a sandwich and some chips', 'two halves of a bagel with a pickle on it', 'a close up of a sandwich on a plate', 'a large sandwich that is sitting on a table', 'a large hamburger and cheese on a plate', 'a close up picture of a sandwich on a bun', 'a close up of a sandwich on a plate', 'a large sandwich with a pickle and a pickle', 'a close up picture of a sandwich on a plate']
['a plate with a bowl of pasta on it', 'a plate of pasta with broccoli and broccoli', 'a plate of pasta with broccoli and broccoli', 'a plate of pasta with broccoli on it', 'a plate of pasta with broccoli and other foods on it', 'a plate of pasta and pasta with broccoli', 'a plate of pasta and pasta on a table', 'the meal has broccoli and other vegetables on the plate', 'a plate with a bunch of food in it', 'a plate with some broccoli and other food on it']
epoch: 6, tr_loss: 2.130743451582733, vl_loss: 2.7172302533380734

In this notebook I walked through the end to end process I used to train a caption generation model using Microsoft's COCO dataset. I also productionized this model to run inference on user supplied photos and put the model in an api behind a webserver. You can see all the code for the project here.

Written on June 12, 2020