Skip to content

Kubernetes backup and monitoring

This resource documents the backup/snapshot and monitoring/logging schemes of the UTHPC Kubernetes service.

Backups and snapshots

Snapshots and backups are, even though they seem similar, distinctively different. Snapshots provide short time protection against accidental file deletions or changes, script malfunctions and other problems that come from changing the data itself by keeping an instance of an older version. These do nothing for physical and hardware problems.

Backing up, on the other hand, means taking a full physical copy and moving it to another, safe system. This usually means copying the whole data over to another system, and keeping it there for later recovery. Due to the sheer amount of space a backup requires, the systems are usually slower, but built with better availability. Restoring from here means copying the data back.

Snapshots

A snapshot is a read-only, point-in-time copy of the filesystem or resource state.

Snapshots Explanation
Schedule Snapshots every night at 0:00.
Retention By default the last seven snapshots.
Scope Snapshot targets only PV's made utilizing the standard StorageClass. Images, containers or anything else isn't snapshotted.
Steps to restore a snapshot Restoring a snapshot requires manual intervention from a UTHPC administrator. Please raise an issue with support.
Notes UTHPC can't guarantee absolutely simultaneous snapshotting of different PVs inside a namespace. There might be slight time differences between them <1s.

Backups

A backup is a physical, read-only copy of the filesystem.

Backups Explanation
Schedule Weekly backups, Sundays on 00:00.
Retention The last four weeks by default.
Scope Backup targets only PV's made utilizing the standard StorageClass. Images, containers or anything else isn't snapshotted.
Steps to restore from a backup Restoring from a backup requires manual intervention from a UTHPC administrator. Please raise an issue with support.
Notes Backups are sequential, which means a backup of different PVs can different in time.

Custom backup/snapshot policy

It's possible to provide users with their own custom backup or snapshot schedule. To do this, please raise an issue with support, together with following information:

  • Reason of needing more frequent snapshots/backups.
  • Namespace or namespaces which require the new schedule.
  • List of users related to the resources.
  • The new backup schedule and retention.

Monitoring and logging

Monitoring and collecting logs are two important parts about a well working system. Monitoring means either white- or black-box monitoring, for example checking if systems are up, everything is answering. Collecting logs means that UTHPC takes the logs that applications/systems/services/machines provide and sends to central UTHPC logging service.

Monitoring

Monitoring is the process of checking whether a system or component is alive. Because there are several layers of systems, there needs to be several layers of checking.

Monitoring Explanation
Systems UTHPC monitors systems and related hardware with own tools, which catch errors like unavailable systems, networking, disk space or anything else in at most 5 minutes.
Black-box endpoints Currently, black box monitoring only checks manually specified endpoints for availability.
Services Kubernetes own LivenessProbe options helps checking services. Administrators monitor services deployed by the administrators.

If you want that UTHPC team monitors your service or resource or you'd like to get your own notifications, raise an issue with the UTHPC support. UTHPC has capabilities to provide that.

Logs

Collecting logs means gathering them from different layers of abstraction. In the case of Kubernetes there is 4 main layers:

  • The physical and OS level
    This layer consists of physical server wellness logs, OS login logs, auditing logs.
  • Kubernetes administrator level
    This layer outputs information about actions done by Kubernetes, users or administrators when using Kubernetes.
  • Service level
    This layer logs are coming from Kubernetes API and kubelet, showing what deployed resources are doing on the cluster and how are they living.
  • Application level
    This layer contains all containers standard outputs. The end user can decide, what data gets into the container's standard output.

UTHPC collects all the previous logs and sends to UTHPC's central logging service elk.hpc.ut.ee . UTHPC keeps these logs in best-effort fashion, but at least for 3 months.

It's possible to provide Kubernetes users with access to their Application and Service level logs, filtered by namespace name. Other logs are for auditing and security reasons, and for UTHPC team to validate and enforce security.