Files, tape-migration and archival¶
This document provides an overview of the general practices of "files" in the UTHPC filesystem and how tape migration works. It is highly recommended to read through the entire document if you plan to tape-migrate or archive files on UTHPC systems.
1. Resident (hot) files on /gpfs/ filesystem¶
In the context of filesystems, the most accurate way to describe a regular file on a filesystem is that it is resident. It means that the entirety of the data that comprises the file is situated on the disks that make up the filesystem.
In the context of access-speed, it behaves like and HDD - a request for data is done in microseconds and the system will reply with the contents of the file requested. In the context of accounting and storage quotas, this also means that all the data in a file is accounted for. This is the most basic form of file storage and the most common in UTHPC.
2. Tape¶
Tape means quite many things, but mainly a few of the following terms are used:
- Tape library: The physical system that holds the tape cartridges and data. Commonly referred to as TS4500 or TSM.
- Tape backup: A backup copy of the file, independent of the original.
- Tape migrated or space managed files: Resident files that are partially in tape.
- Tape migration: The process of migrating files to tape.
- (Tape) recall: Usually, the process of recalling files from tape, in the context of space management.
- Tape restore: The process of restoring data from a tape backup.
- (Tape) archive: An archival copy of a file, usually in tape, independent of the resident file.
It is important to note here that 2 (backup, archive) of the 3 types of files in tape are independent with the exception of migrated files.
2.1 Tape backup file/copy¶
In most UTHPC systems, backup copies of data are done to tape. This is a point-in-time full copy of the resident data on a system. This point-in-time copy is immutable and cannot easily be overwritten. It can however be used to restore lost or compromised files.
Tape backup copies also expire, this is totally normal in any information lifecycle. A regular backup-copy lifecycle goes as following:
- file is created on disk;
- a backup schedule runs, verifies backup copy;
- backup copy is made if missing.
Now what happens if a file is modified? A new copy of the file is stored in tape. This action now creates a new backup version of the original file, and this style of backup is often called incremental backup. Multiple point-in-time copies of a file may exist, depending on how often the file is changed and how it places within the backup schedule window.
In UTHPC, at most 3 incremental copies are held. This means there is an active version of the backup copy, and 2 inactive copies. Once a fourth incremental copy is put to tape, the oldest one is flagged for expiry due to policy rules that are written in backup. Once expiry starts, 30 days are given to an inactive copy until that version is fully deleted. This is the first type of file expiration in tape.
There is a second type of expiration - file deletion. Once the resident file is deleted from the source system and a new backup is scheduled, during evaluation the file is flagged for expiry once again. This is a second type of expiration and is usually 60 days.
2.2 Tape pre-migration¶
Before migration is the pre-migration step.
Before a file can be space-managed, a copy must be made into tape. This is a different copy from the backup copy and does not support versioning, meaning only the active copy can be migrated. This copy is a bit different from full backup copies, keeping less metadata in general.
This is where a bit of the complication also arises - in UTHPC there is a general rule that 2 copies of every file must exist. The most common way to achieve this is by tape backup.
So, thus far in the migration the procedure goes as following:
- file is created resident on the disk;
- a backup scheduler or UTHPC admin schedules a backup operation on the file;
- if migration is requested (by the client or by UTHPC administrator for space management purposes), pre-migration starts;
- a tape backup copy is validated and if it exists a second (but logically and physically) different copy of the file is created in a tape space management pool.
Now 3 copies of the file exist:
- resident file,
- tape backup and
- space-managed copy.
For accounting purposes, this file is considered resident and migrated, but is billed resident only.
2.3 Tape migration¶
Knowing the previous, the migration step is quite simple - remove most of the resident file from the disk. This is also the transparent black-magic layer that makes tape migration invisible to most users.
In the migration step, the file is once again validated - the resident, backup and space managed copies must match 100%. If this criteria is met, the resident files contents are removed except for the first 96 MB. This shell-file is called a stub-file and has some special properties.
- Accessing the contents of this file will automatically trigger a recall event from tape. This is done on the filesystem level, invisible to the user. This is also done on a per-file basis, meaning no ordering is done to optimize the recall process.
- Since the stub-file no longer has the full contents of the original file, extra bits are set to "exclude" this file from backups. Seems kind of weird, but since the stub file can not be modified on tape, the backup scheduler can say with certainty that it has not changed. And since policy dictates that a valid backup must exist, it has already been backed up.
- Deleting this file will now trigger 2 different expiration events, one for the backup copy and one for the space managed copy. The expiration for space management works almost identically to the expiration of a backup copy with the exception that the space managed copy will start expiry immediately.
Warning
If you delete a migrated resident file, after a set period of time, all tape copies will expire as well and you lose your file.
3. Why migrate?¶
File migration is usually done for the following reasons:
- On the admin side, the oldest and least accessed files are migrated to tape pending that a filesystem is close to capacity. This is done without prejudice except for the 2 following conditions. Days since last access but be greater than 365 days and file size must be greater than 1 GB. Also, user home folders are almost never migrated.
- As a user, this allows you to hide cold data in tape while still maintaining access to the files. The recall procedure is automatic and comparatively quick allowing access to data. A single file takes about a minute to start recall and once recalling goes at a relative tempo of 15 TB/day.
- Billing. Tape is comparatively cheaper in UTHPC (30€ TB/year for tape vs 80€ TB/year for resident). Now this is not the definitive truth. Resident and tape copies are billed differently, and many different agreements are in place. But in essence, it's still cheaper.
4. Working with migrated files¶
How do I migrate files?
As a user you can't. This high level function of data management is only available for UTHPC administrators, so to migrate files to tape for space saving purposes you must contact UTHPC.
How do I recall files?
You can recall single files by just reading the beginning aka the stub-file. This will automatically start pulling the file back.
If you have a larger folder and would like to recall it faster, then please contact UTHPC for a tape-optimized bulk recall.
How do I know that my files are migrated?
This is the main shortcoming of space management - it is quite transparent to the user. A space managed file looks identical to a regular resident file. Even starting a batch job on it you might not notice the extra time it takes to recall it. There are a few tricks though.
The simplest way is to ask UTHPC, we can do bulk counting using a custom script that's able to sieve migrated files from resident files.
But as a user, you can find them using careful comparison of ls
and du
. ls
will show file size as defined by the file, this is the bigger number and considered the "size of the file." du
however is short for disk utilization and will count the bits on the disk. This means that du
will report the file-size as 96 MB as that is the data that is kept on the disk. Also, running du
will not recall the file.
What should I migrate, if at all?
It's a tricky question because it's all in the nature of the data. The 96 MB stub-file size also mandates that files smaller than that will not be migrated.
In short, it is up to you as the data owner to consider how often the data needs to be accessed. The process of migration is free and quite hands-off so workload overhead for UTHPC is minimal.
5. Archival¶
File archival is quite different from migration or space management. File archiving is usually required because:
- a researcher never wants to discard data;
- "We might need it in the future";
- from the grant/agreement/contract arises the requirement that the data must be stored within the agreed time period.
Long-term archival (in the order of 5+ years) is still a work in progress in UTHPC. The base has been set, but why this has not been implemented yet is not a technical problem but a matter of in-house policy.
For example, let us consider the questions an archivist might ask if requested to archive "something" for 10 years:
- Who is responsible for this in data management, billing, etc.?
- Who's data is it?
- What are you archiving?
- Is this "maybe in 10 years I need it" or it must be kept 10 years minimum since X policy
- In 10 years, can you accurately describe the data you are requesting to the manager of the archive?
The last one is the trickiest since this requires bookkeeping from both the client and UTHPC side. There are currently no procedures in place and most see archival as a quick way of cutting costs, whether self-mandated or by higher-up. Unfortunately, data integrity usually takes precedent and until these procedures are in place, file-migration is a nice middle ground, keeping the data lukewarm. For long term archival there are no shortcuts, and the action of archiving is one that should not be taken for granted. Keeping "hot" data is a lot easier since constant actions are taken to maintain a single filesystem. On the contrary, archived data is something that is put away for many years and might come with many surprises after a few years.
If you are interested in long-term archival, please contact HPC and we can begin start figuring out all of the necessary details
For any extra comments/notes/suggestions, please contact UTHPC.