Amazon Glacier is a cheap massive cloud storage solution that is mostly suitable for storing cold data - data that are rarely accessed. The price is fair: $4/TB/month. However, it's not like Dropbox or Googld Drive that has nice client programs that you can simply drag and drop the files to be stored. Instead, you'll have to work with their APIs to upload you files. In this post, I'll explain the basics about how to upload the files and also how to query the inventory.

Basic APIs

I use the boto3 API in Python. The documentation for Glacier can be found in here. I'll use the Client APIs, which simply wrap the underlying HTTP requests. In particular, these are the APIs we'll be using for basic upload and query.

  • upload_archive: upload files.
  • delete_archive: delete files. Note that files on Glacier is not mutable. To update a file, you'll have to delete the old one and then upload the new one.
  • initiate_job: to download files stored in Glacier or to query inventory.
  • describe_job: to query job status. This is asynchronous to initiate_job.

Upload Files

We'll use the upload_archive API to upload a file. Things to note:

  • You need a pair of access ID and key to use the API. Follow the guide to set up boto3 correctly.
  • You must create a "vault" before you can upload. You can do this in the Glacier manage console.

To upload a file:

1
2
3
4
5
6
7
client = boto3.client('glacier')
with open(path, 'rb') as f:
    response = client.upload_archive(vaultName='myvault',
                    archiveDescription=path,
                    body=f)
    # persist the map between response['archiveId'] and path somewhere
    # locally

Note that the archiveDescription argument is optional, but we utilize it to store the file's local path. This will help us bookkeeping later on. Inside Glacier, the file is solely identified by the archiveId.

It is advised to keep a local database of the files stored in Glacier, since the inventory is only updated every 24 hrs.

Update Inventory

Sometimes the local archive database may be out-of-sync with Glacier, in which case a force-sync may be necessary. Basically we'll pull the inventory of Glacier and re-build the local archive database from that.

Warnings:

  • The Glacier inventory is only updated every 24 hrs. So files uploaded within last 24 hrs may not be reflected in the inventory.
  • The inventory query can take up to several hours to finish.

The main API we will use is initiate_job, together with describe_job to query job status and get_job_output to retrieve the results once the job is finished. The same work flow can also be used to download a previously uploaded archive using the archive ID. But here we'll only show how to query the inventory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
job req = client.initiate_job(vaultName='myvault',
            jobParameters={'Type': 'inventory-retrieval'})

while True:
    status = client.describe_job(vaultName='myvault',
                    jobId=job_req['jobId']
    if status['Completed']:
        break
    time.sleep(300)

job_resp = client.get_job_output(vaultName='myvault',
            jobId=job_req['jobId'])

# first download the output and then parse the JSON
output = job_resp['body'].read()

archive_list = json.loads(output)['ArchiveList']
# persist archive_list