Skip to content

Cloud backup and monitoring

Below are the backup/snapshot and monitoring/logging schemes of the UTHPC Cloud/OpenStack service.

Backups and snapshots

Snapshots and backups are, even though they seem similar, distinctively different. Snapshots provide short time protection against accidental file deletions or changes, script malfunctions and other problems that come from changing the data itself by keeping an instance of an older version. These do nothing for physical and hardware problems.

Backing up, on the other hand, means taking a full physical copy and moving it to another, safe system. This usually means copying the whole data over to another system, and keeping it there for later recovery. Due to the sheer amount of space a backup requires, the systems are usually slower, but built with better availability. Restoring from here means copying the data back from the backup system.

In case of the cloud, there are two different levels of backup - either backing up all the disks of a machine, or by copying data out from the machine itself. Due to the fact that UTHPC doesn't have access to inside virtual machines, UTHPC does backups only by backing up the disks themselves. Everything that's not yet written down to disk isn't recoverable for example RAM content. Changes between snapshots or backup aren't recoverable. It means that a file created and deleted between two snapshots are permanently deleted. Over the time disk snapshot consumes same amount of storage as original disk because UTHPC needs to keep all changes to guarantee point-in-time disk content.

Snapshots

A snapshot is a read-only, point-in-time copy of the whole disk.

Snapshots Storage Type Explanation
Schedule Scratch User is responsible for making Snapshots
Schedule Prod Global snapshots are usually made every week. User can do snapshots at any time.
Retention Scratch User determines it's retention schedule.
Retention Prod The last three snapshots by default. User defines the schedule for user snapshots.
Scope Scratch User schedules snapshot and restore from snapshot. Snapshots of all VM disks together at the same point in time.
Scope Prod The whole system snapshot at once.
Steps to restore from snapshot Any Restoring a snapshot requires user intervention.
Notes Prod These global snapshots are mainly for UTHPC's own problem coverage, and while UTHPC helps out if there are any problems, this isn't a guaranteed service. UTHPC does snapshots on a best-effort basis.

Backups

A backup is a physical, read-only copy of the whole disk under a filesystem.

Backups Storage Type Explanation
Schedule Scratch User created snapshot backup happens shortly after creation of snapshot. Disks without snapshots aren't backed up. Transferring takes time, during which you can't delete snapshot.
Schedule Prod Weekly backups. User created snapshots have first priority. Because transferring takes time, it can complete any time during the week.
Retention Scratch The last two snapshot backups by default.
Retention Prod The last two snapshot and whole disk backups by default.
Scope Scratch Snapshot and its base image is linearly copied to tape storage. Backup, restore cycle involves copying entire storage device.
Scope Prod The whole filesystem snapshot at once and backed up from that snapshot. Only whole disks at a time.
Steps to restore a backup Any Restoring a backup requires manual intervention from a UTHPC administrator. Please raise an issue with support@hpc.ut.ee .
Notes Any These backups are mainly for UTHPC's own problem coverage, and while UTHPC helps out if there are any problems, this isn't a guaranteed service. UTHPC does the backups on a best-effort basis, and backups aren't guaranteed to be perfect every week.

Custom backup/snapshot policy

It's possible to provide users with their own custom backup schedule. To do this, please raise an issue with support, together with the following information:

Needed information

  • Reason of needing more frequent backups.
  • Projects and machines which require the new schedule.
  • List of users related to the resources.
  • The new backup schedule and retention.

Monitoring and logging

Monitoring and collecting logs are two important parts about a well working system. Monitoring means either white- or black-box monitoring, for example checking if systems are up, everything is answering. Collecting logs means that UTHPC team takes the logs that applications/systems/services/machines provide and ships them together for parsing and searching.

Monitoring

Monitoring is the process of checking whether a system or component is alive. Because there are several layers of systems, there needs to be several layers of checking.

Monitoring Explanation
Systems UTHPC department monitors systems and related hardware with own tools, which catch errors like unavailable systems, networking, disk space or anything else in at most 5 minutes.
Black-box endpoints Currently, black box monitoring only checks manually specified endpoints for availability.
Services Services are only checked when manually specified by UTHPC admins.

If you want that the UTHPC team monitors your services and resources or you'd like to get your own notifications, please notify the UTHPC support (support@hpc.ut.ee ) of your request. UTHPC has capabilities to provide that.

Logs

Collecting logs means gathering them from different layers of abstraction. In the case of Cloud/OpenStack there are 3 main layers:

  1. The physical and OS level
    This layer consists of physical server wellness logs, OS login logs, auditing logs.
  2. OpenStack administrator level
    This layer outputs information about actions done by OpenStack, administrators or users when using OpenStack.
  3. Service level
    This layer logs are coming from OpenStack API and software components, showing what resources are doing on the cluster and how are they performing.

UTHPC collects previous logs and sends to UTHPC's central logging service https://elk.hpc.ut.ee . UTHPC is keeping these logs in best-effort fashion, but at least for 3 months.