Use Preemptible GPUs on Google Cloud with Automatic Restart

As shown in Google Cloud Pricing page , preemptible GPUs are significantly cheaper. However, if we use them for training models, we need it to be able to restart automatically after an preemption.

A fixed size instance group of size 1 can address the restart problem. However, because the VMs are from templates, they become stateless. Also, after the restart, we need the model training to start.

My solution:

Create another Persistent Disk to store the model checkpoints
Use the startup script to do 2 things upon instance start:
- Use the gcloud CLI to attach the disk to the running VM in rw mode. This solves the problem that instance template only supports PDs in Read-Only mode.
- Run the model training script. I use docker-compose to make things simpler.

For example, here is my startup script:

#! /bin/bash
gcloud compute instances attach-disk $(hostname) --disk=disk-1 --mode=rw --zone=us-central1-a
su chenxing -c 'sh /home/chenxing/run_docker.sh'

一	二	三	四	五	六	日
« 5月				10月 »
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30