Kubernetes backup and monitoring¶

This resource documents the backup/snapshot and monitoring/logging schemes of the UTHPC Kubernetes service.

Backups and snapshots¶

Snapshots and backups are, even though they seem similar, distinctively different. Snapshots provide short time protection against accidental file deletions or changes, script malfunctions and other problems that come from changing the data itself by keeping an instance of an older version. These do nothing for physical and hardware problems.

Backing up, on the other hand, means taking a full physical copy and moving it to another, safe system. This usually means copying the whole data over to another system, and keeping it there for later recovery. Due to the sheer amount of space a backup requires, the systems are usually slower, but built with better availability. Restoring from here means copying the data back.

Snapshots¶

A snapshot is a read-only, point-in-time copy of the filesystem or resource state.

Snapshots	Explanation
Schedule	Snapshots every night at 0:00.
Retention	By default the last seven snapshots.
Scope	Snapshot targets only PV's made utilizing the `standard` StorageClass. Images, containers or anything else isn't snapshotted.
Steps to restore a snapshot	Restoring a snapshot requires manual intervention from a UTHPC administrator. Please raise an issue with support.
Notes	UTHPC can't guarantee absolutely simultaneous snapshotting of different PVs inside a namespace. There might be slight time differences between them `<1s`.

Backups¶

A backup is a physical, read-only copy of the filesystem.

Backups	Explanation
Schedule	Weekly backups, Sundays on 00:00.
Retention	The last four weeks by default.
Scope	Backup targets only PV's made utilizing the `standard` StorageClass. Images, containers or anything else isn't snapshotted.
Steps to restore from a backup	Restoring from a backup requires manual intervention from a UTHPC administrator. Please raise an issue with support.
Notes	Backups are sequential, which means a backup of different PVs can different in time.

Restoring a snapshot or backup¶

Volume level snapshots and backups need to be restored by UTHPC infrastructure team admins due to the fact, that physical volumes are not namespaced objects in Kubernetes. If you need to restore a volume, please contact UTHPC support with the PVC name and namespace.

Custom backup/snapshot policy¶

It's possible to provide users with their own custom backup or snapshot schedule. To do this, please raise an issue with support, together with following information:

Reason of needing more frequent snapshots/backups.
Namespace or namespaces which require the new schedule.
List of users related to the resources.
The new backup schedule and retention.

Monitoring and logging¶

Monitoring and collecting logs are two important parts about a well working system. Monitoring means either white- or black-box monitoring, for example checking if systems are up, everything is answering. Collecting logs means that UTHPC takes the logs that applications/systems/services/machines provide and sends to central UTHPC logging service.

Monitoring¶

Monitoring is the process of checking whether a system or component is alive. Because there are several layers of systems, there needs to be several layers of checking.

Monitoring	Explanation
Systems	UTHPC monitors systems and related hardware with own tools, which catch errors like unavailable systems, networking, disk space or anything else in at most 5 minutes.
Black-box endpoints	Currently, black box monitoring only checks manually specified endpoints for availability.
Services	Kubernetes own LivenessProbe options helps checking services. Administrators monitor services deployed by the administrators.

If you want that UTHPC team monitors your service or resource or you'd like to get your own notifications, raise an issue with the UTHPC support. UTHPC has capabilities to provide that.

Logs¶

Collecting logs means gathering them from different layers of abstraction. In the case of Kubernetes there is 4 main layers:

The physical and OS level

This layer consists of physical server health logs, OS login logs, auditing logs.
Service level

Logs for this level are coming from Kubernetes API server and kubelet components, showing all the activities by users, deployed resources, service accounts and the control plane itself.
Application level

This layer contains all containers standard outputs. The end user can decide, what data gets into the container's standard output.

UTHPC collects all the previous logs and sends to UTHPC's central logging service elk.hpc.ut.ee . UTHPC keeps these logs in best-effort fashion, but at least for 3 months.

It's possible to provide Kubernetes users with access to their Application and Service level logs, filtered by namespace name. Other logs are for auditing and security reasons, and for UTHPC team to validate and enforce security.