Cloud backup and monitoring¶

Below are the backup/snapshot and monitoring/logging schemes of the UTHPC Cloud/OpenStack service.

Backups and snapshots¶

Snapshots and backups are, even though they seem similar, distinctively different. Snapshots provide short time protection against accidental file deletions or changes, script malfunctions and other problems that come from changing the data itself by keeping an instance of an older version. These do nothing for physical and hardware problems.

Backing up, on the other hand, means taking a full physical copy and moving it to another, safe system. This usually means copying the whole data over to another system, and keeping it there for later recovery. Due to the sheer amount of space a backup requires, the systems are usually slower, but built with better availability. Restoring from here means copying the data back from the backup system.

In case of the cloud, there are two different levels of backup - either backing up all the disks of a machine, or by copying data out from the machine itself. Due to the fact that UTHPC doesn't have access to inside virtual machines, UTHPC does backups only by backing up the disks themselves. Everything that's not yet written down to disk isn't recoverable for example RAM content. Changes between snapshots or backup aren't recoverable. It means that a file created and deleted between two snapshots are permanently deleted. Over the time disk snapshot consumes same amount of storage as original disk because UTHPC needs to keep all changes to guarantee point-in-time disk content.

Snapshots¶

A snapshot is a read-only, point-in-time copy of the whole disk.

Snapshots	Storage Type	Explanation
Schedule	Scratch	User is responsible for making Snapshots
Schedule	Prod	Global snapshots are usually made every week. User can do snapshots at any time.
Retention	Scratch	User determines it's retention schedule.
Retention	Prod	The last three snapshots by default. User defines the schedule for user snapshots.
Scope	Scratch	User schedules snapshot and restore from snapshot. Snapshots of all VM disks together at the same point in time.
Scope	Prod	The whole system snapshot at once.
Steps to restore from snapshot	Any	Restoring a snapshot requires user intervention.
Notes	Prod	These global snapshots are mainly for UTHPC's own problem coverage, and while UTHPC helps out if there are any problems, this isn't a guaranteed service. UTHPC does snapshots on a best-effort basis.

Backups¶

A backup is a physical, read-only copy of the whole disk under a filesystem.

Backups	Storage Type	Explanation
Schedule	Scratch	User created snapshot backup happens shortly after creation of snapshot. Disks without snapshots aren't backed up. Transferring takes time, during which you can't delete snapshot.
Schedule	Prod	Weekly backups. User created snapshots have first priority. Because transferring takes time, it can complete any time during the week.
Retention	Scratch	The last two snapshot backups by default.
Retention	Prod	The last two snapshot and whole disk backups by default.
Scope	Scratch	Snapshot and its base image is linearly copied to tape storage. Backup, restore cycle involves copying entire storage device.
Scope	Prod	The whole filesystem snapshot at once and backed up from that snapshot. Only whole disks at a time.
Steps to restore a backup	Any	Restoring a backup requires manual intervention from a UTHPC administrator. Please raise an issue with support@hpc.ut.ee .
Notes	Any	These backups are mainly for UTHPC's own problem coverage, and while UTHPC helps out if there are any problems, this isn't a guaranteed service. UTHPC does the backups on a best-effort basis, and backups aren't guaranteed to be perfect every week.

Custom backup/snapshot policy¶

It's possible to provide users with their own custom backup schedule. To do this, please raise an issue with support, together with the following information:

Needed information

Reason of needing more frequent backups.
Projects and machines which require the new schedule.
List of users related to the resources.
The new backup schedule and retention.

Monitoring and logging¶

Monitoring and collecting logs are two important parts about a well working system. Monitoring means either white- or black-box monitoring, for example checking if systems are up, everything is answering. Collecting logs means that UTHPC team takes the logs that applications/systems/services/machines provide and ships them together for parsing and searching.

Monitoring¶

Monitoring is the process of checking whether a system or component is alive. Because there are several layers of systems, there needs to be several layers of checking.

Monitoring	Explanation
Systems	UTHPC department monitors systems and related hardware with own tools, which catch errors like unavailable systems, networking, disk space or anything else in at most 5 minutes.
Black-box endpoints	Currently, black box monitoring only checks manually specified endpoints for availability.
Services	Services are only checked when manually specified by UTHPC admins.

If you want that the UTHPC team monitors your services and resources or you'd like to get your own notifications, please notify the UTHPC support (support@hpc.ut.ee ) of your request. UTHPC has capabilities to provide that.

Logs¶

Collecting logs means gathering them from different layers of abstraction. In the case of Cloud/OpenStack there are 3 main layers:

The physical and OS level

This layer consists of physical server wellness logs, OS login logs, auditing logs.
OpenStack administrator level

This layer outputs information about actions done by OpenStack, administrators or users when using OpenStack.
Service level

This layer logs are coming from OpenStack API and software components, showing what resources are doing on the cluster and how are they performing.

UTHPC collects previous logs and sends to UTHPC's central logging service https://elk.hpc.ut.ee . UTHPC is keeping these logs in best-effort fashion, but at least for 3 months.