As shown in Google Cloud Pricing page , preemptible GPUs are significantly cheaper. However, if we use them for training models, we need it to be able to restart automatically after an preemption.
A fixed size instance group of size 1 can address the restart problem. However, because the VMs are from templates, they become stateless. Also, after the restart, we need the model training to start.
- Create another Persistent Disk to store the model checkpoints
- Use the startup script to do 2 things upon instance start:
- Use the gcloud CLI to attach the disk to the running VM in rw mode. This solves the problem that instance template only supports PDs in Read-Only mode.
- Run the model training script. I use docker-compose to make things simpler.
For example, here is my startup script:
#! /bin/bash gcloud compute instances attach-disk $(hostname) --disk=disk-1 --mode=rw --zone=us-central1-a su chenxing -c 'sh /home/chenxing/run_docker.sh'