Skip to Main Content

Digital preservation

Intelligent storage

Storage is often the most thought about thing in digital preservation. While it is foundational to a digital preservation programme, it is only one component of it.

When it comes to storage, you ideally want to follow these main principles, though there is no one solution for an organisation

  • 2 online copies and 2 tape (nearline and/or offline copies)
  • A minimum of 2 geographically distributed locations 
  • If you are going to use cloud storage, do your research carefully (what happens if the third party suppliers goes out of business?)
  • Use different technologies  (diversify your hardware and software)

Tiered storage for different types of digital data is popular, when taking into account costs and usage. Many large research data sets are stored only on tape rather than also disk, due to the size and cost of keeping enough disk space. It is important to make those decisions and have them well documented. This is just the basics of preservation storage and it does not include the preservation systems and associated software that should sit on top of it in order to ingest, manage, and audit the digital objects.

We often refer to storage as intelligent storage because it needs to be flexible, it needs to be scalable and it needs humans to manage it. This is the same with our preservation systems. For example, if we keep copies on tape, then we should occasionally check those tapes, and we need a person to be responsible for this activity and to document the results. Do the checksums still match (integrity checking)? Can the data on the tape be read? Can we restore the data to spinning disk without any issue? It is important to test storage systems and monitor the data you are storing on them, otherwise you are not protecting your digital objects well. Essentially, you are exposing them to risk. 

It is also important to check how often hardware and software are replaced because that is also important to protect against failure; these things have life spans and will eventually fail. In these scenarios, the human element is as important as the technology that underpins storage systems - humans will be the ones to test the system, check that everything is working and will make sure refreshment happens regularly.

Backup is NOT preservation!

It is important to understand that backup alone is not digital preservation. It is only a part of it.

More importantly, how good are your backups if you never practice recovery? If you have not tried restoring any of your backups, then assume they will likely fail. It is important to occasionally restore from tape and to check digital materials at random. Relying solely on third party backup systems to do the work for you is NOT digital preservation.

Remember: 

  • A backup is only a backup if you can recover it!
  • A backup is only part of digital preservation!

Information security

Many types of digital material selected for long-term preservation may contain confidential and sensitive information that must be protected to ensure they are not accessed by non-authorised users. In many cases these may be legal or regulatory obligations on the organization. These materials must be managed in accordance with the organizations' Information Security Policy to protect against security breaches. ISO 27001 describes the manner in which security procedures can be codified and monitored (ISO, 2013a). ISO 27002 provides guidelines on the implementation of ISO 27001-compliant security procedures (ISO, 2013b). Conforming organizations can be externally accredited and validated. In some cases your own organizations' Information Security Policy may also impact on digital preservation activities and you may need to enlist the support of your Information Governance and ICT teams to facilitate your processes.

Information security methods such as encryption add to the complexity of the preservation process and should be avoided if possible for archival copies. Other security approaches may therefore need to be more rigorously applied for sensitive unencrypted files; these might include restricting access to locked-down terminals in controlled locations (secure rooms), or strong user authentication requirements for remote access. However, these alternative approaches may not always be sufficient or feasible. Encryption may also be present on files that are received on ingest from a depositor, so it is important to be aware of information security options such as encryption, the management of encryption keys, and their implications for digital preservation.

Techniques for protecting information

Several information security techniques may be applied to protect digital material, though this list is not exhaustive:

  • Encryption is a cryptographic technique which protects digital material by converting it into a scrambled form. Encryption may be applied at many levels, from a single file to an entire disk. Many encryption algorithms exist, each of which scramble information in a different way. These require the use of a key to unscramble the data and convert it back to its original form. The strength of the encryption method is influenced by the key size. For example, 256-bit encryption will be more secure than 128-bit encryption.

    It should be noted that encryption is only effective when a third party does not have access to the encryption key in use. A user who has entered the password for an encrypted drive and left their machine powered on and unattended will provide third parties with an opportunity to access data held in the encrypted area, which may result in its release.

    Similarly encryption security measures (if used) can lose their effectiveness over time in a repository: there is effectively an arms race between encryption techniques and computational methods to break them. Hence, if used, all encryption by a repository must be actively managed and updated over time to remain secure.

    Encrypted digital material can only be accessed over time in a repository if the organization manages its keys. The loss or destruction of these keys will result in data becoming inaccessible.

  • Access controls allow an administrator to specify who is allowed to access digital material and the type of access that is permitted (for example read only, write).
  • Redaction refers to the process of analyzing a digital resource, identifying confidential or sensitive information, and removing or replacing it. Common techniques applied include anonymization and pseudonymization to remove personally identifiable information, as well as cleaning of authorship information. When related to datasets this is usually carried out by the removal of information while retaining the structure of the record in the version being released. You should always carry out redaction on a copy of the original, never on the original itself.

    The majority of digital materials created using office systems, such as Microsoft Office, are stored in proprietary, binary-encoded formats. Binary formats may contain significant information which is not displayed, and its presence may therefore not be apparent. They may incorporate change histories, audit trails, or embedded metadata, by means of which deleted information can be recovered or simple redaction processes otherwise circumvented. Digital materials may be redacted through a combination of information deletion and conversion to a different format. Certain formats, such as plain ASCII text files, contain displayable information only. Conversion to this format will therefore eliminate any information that may be hidden in non-displayable portions of a bit stream.

Report a problem