LibGuides: Digital preservation: File formats

Identifying file formats

File format identification is an important part of digital preservation. Knowing what type of file format you have and what version it is, will assist with preservation planning for that digital object. It will also provide information on the types of software programs that can open and render the digital object. It is important to note that a program may be able to open a particular file format, but it may not render it correctly. This means that the look and feel could be altered, sometimes slightly and sometimes making it difficult to interpret. This is particularly true for older file formats that were created with legacy software programs. Be aware that legacy can mean only 10 years!

Knowing the file format and version of a digital object also means you can plan for its future. Does it need to be normalised on ingest? Does it need to be migrated to a new file format? Would emulation be a better fit? This is all part of preservation planning

File format identification tools and methods constantly improves and develops. File format identification should not be seen as a one-off activity which is only ran when a digital object is first given to a repository; it is good practice to regularly run identification software over collections to benefit from new development tools.

DROID

PRONOM

Other identification tools

DROID (Digital Record Object Identification) is a tool for automated batch identification of file formats. DROID uses the PRONOM registry to identify file formats based on file format signature, file extension and other technical information contained in PRONOM. It can export reports to .CSV files for querying and creating statistics from.

DROID is a free and open source digital preservation tool. The newest version can be downloaded here.

DROID - latest version
File format identification tool
DROID - previous versions
File format identification tool

PRONOM is a technical registry of file formats that has been created and maintained by The National Archives. It contains information about file formats and supporting software products or technical components. It is a resource to support ingest and long-term digital preservation.

It is regularly maintained and updated by the National Archives. While it is not a comprehensive list of file formats, submissions are encouraged. Researchers working with rare and proprietary file formats, as well as research data managers and archivists have made submissions to PRONOM. Information on how to submit can be found here.

PRONOM
Technical registry of file formats

Siegfried is file format identification tool that uses the PRONOM registry, but is available to use in the web browser as well as available for download and installation.

FIDO is available from the Open Preservation Foundation and also uses the PRONOM registry.

File formats for preservation

Selecting target formats for preservation

Not all digital formats are suited or indeed designed for archiving or preservation. Any preservation policy should therefore recognize the requirements of the collection content and decide upon a file format which best preserves those qualities. Pairing content with a suitable choice of preservation format or access format; identifying what is important in the content.

Here are some factors to consider in selecting your preferred file format:

Open Source versus proprietary

Open source formats, such as JPEG2000, are very popular due to their non-proprietary nature and the sense of ownership that stakeholders can attain with their use. However, the choice of open source versus proprietary formats is not that simple and needs to be looked at closely. Proprietary formats, such as TIFF, are seen as being very robust; however, these formats will ultimately be susceptible to upgrade issues and obsolescence if the owner goes out of business or develops a new alternative. Similarly, open source formats can be seen as technologically neutral, being non-reliant on business models for their development however they can also been seen as vulnerable to the susceptibilities of the communities that support them.

Although such non-proprietary formats can be selected for many resource types this is not universally the case. For many new areas and applications, e.g. Geographical Information Systems or Virtual Reality only proprietary formats are available. In such cases a crucial factor will be the export formats supported to allow data to be moved out of (or into) these proprietary environments.

Documentation and standards

The availability of documentation - for example, published specifications - is an important factor in selecting a file format. Documentation may exist in the form of vendor’s specifications, an international standard, or may be created and maintained within the context of a user community. Look for a standard which is well-documented and widely implemented. Make sure the standard is listed in the PRONOM file format registry.

Adoption

A file format which is relied upon by a large user group creates many more options for its users. It is worth bearing in mind levels of use and support for formats in the wider world, but also finding out what organizations similar to you are doing and sharing best practice in the selection of formats. Wide adoption of a format can give you more confidence in its preservation.

Lossless versus lossy

Lossy formats are those where data is compressed, or thrown away, as part of the encoding. The MP3 format is widely used for commercial distribution of music files over the web, because the lossy encoding process results in smaller file sizes.

TIFF is one example of an image format that is capable of supporting lossless data. It could hold a high-resolution image. JPEG is an example of a lossy image file format. Its versatility, and small file size, makes it a suitable choice for creating an access copy of an image of smaller size for transmission over a network. It would not be appropriate to store the JPEG image as both the access and archival format because of the irretrievable data loss this would involve.

One rule of thumb could be to choose lossless formats for the creation and storage of "archival masters"; lossy formats should only be used for delivery / access purposes, and not considered to be archival. A rule like this is particularly suitable for a digitization project, particularly still images.

Support for metadata

Some file formats have support for metadata.This means that some metadata can be inscribed directly into an instance of a file (for example, JPEG2000 supports some rights metadata fields). This can be a consideration, depending on your approach to metadata management.

Significant properties of file formats

This is a complex area. One view regards significant properties as the "essence" of file content; a strategy that gets to the heart of "what to preserve". What does the user community expect from the rendition? What aspects of the original are you trying to preserve? This strategy could mean you don’t have to commit to preserving all aspects of a file format, only those that have the most meaning and value to the user.

Significant properties may also refer to a very specific range of technical metadata that is required to be present in order for a file to be rendered (e.g. image width). Some migration tools may strip out this metadata, or it may become lost through other curation actions in the repository. The preservation strategy needs to prevent this loss happening. It thus becomes important to identify, extract, store and preserve significant properties at early stage of the preservation process.

- from the Digital Preservation Coalition Handbook, 2nd Edition

Validation

File format validation does a number of functions that help to confirm a file format is well-form and valid. Validation will:

confirm that a file conforms to the specific file format specification, which is a set of documentation that list the standards a specific file format must follow (this includes specific file signature information and embedded metadata).
notify if the file format does not conform to the specification.
it will ensure that files can be read by future readers. A valid file format is much easier to manage over time than one that does not conform. A file format that is not valid may create issues over time, especially when trying to change the file format type (known as migration). Access issues over time will also be harder to diagnose if it does not conform and can no longer be opened. Future software may also have issues rendering the file correctly if it does not conform to the specification.

For these reasons, file format validation is important. It is an especially useful tool for digitisation workflows as it will ensure that digital objects are being created correctly. When you are in control of creating a digital object, validation is an important step. However, it is important to know that file format validation ahs the following limitations:

Validation tools improve over time, and it is therefore recommended that validation software is regularly ran over digital objects to catch new non-conformances.
Not every file format type has validation software and it is worth knowing what is available.
Validation software will also not find issues with files that do not relate to the specific rules set for the file format specification. For example, a TIFF file may have some visual corruption in it that the rules set in the file format validation software will not look for. It will therefore call the file well-formed and valid, though upon inspection, something is clearly wrong with the file.

This is why fixity is equally as important in digital preservation. It can help detect visual corruption of those files with the early generation of what is known as a checksum. The section on fixity goes into greater detail on creating and confirming checksums and their uses in digital preservation.

JHOVE	Jpylyzer	veraPDF
The most common validation tool is JHOVE, maintained by the Open Preservation Foundation. It is an open source validation tool that can validate the following file formats: TIFF JPEG PDF AIFF ASCII GIF JPEG2000 HTML BYTESTREAM UTF8 WAVE XML JHOVE stands for JSTOR/Harvard Object Validation Environment. It was a joint project between JSTOR and Harvard University to create a tool to validate files and extract metadata. In 2015, the maintenance of the software was transferred to the Open Preservation Foundation.	Jpylyzer is a validation tool for JPEG2000 (JP2) images. It also reports on the image's technical characteristics or technical metadata (called a feature extraction). It is an open source tool maintained by the Open Preservation Foundation. The creation of this validation tool was made possible by partial funding from the EU FP 7 project known as SCAPE. It is a richer validation tool for JPEG2000 images than JHOVE and is therefore preferred for validating this file type. It is commonly used in digitization workflows were TIFF files are migrated to JPEG2000 storage and access reasons. Unlike JHOVE, Jpylyzer will only validate one file format, but it has a richer validation rules set for JPEG2000 than JHOVE.	veraPDF validates all PDF/A parts and conformance levels. PDF/A is a version of PDF intended for long term preservation and archving of electronic documents. PDF/A is meant to prohibit features that are not suitable for long term preservation, including font linking (instead it will embed the font file in the document), encryption and annotations. However it does not work for every document and creating a valid PDF/A can be labour intensive. Conformance levels include A (Accessible), B (Basic) and U (Unicode). U was created to deal with the specialized fonts and characters like Greek, Arabic, Chinese and so on. On top of conformance levels, there are also three versions of PDF/A, which means a PDF/A document has a version number and conformation level associated with it. veraPDF will help to validate the various versions and conformance levels of PDF/A, but will not be able to validate any other version of PDF -- JHOVE will be required for that. It is good practice to validate a PDF/A file using both veraPDF and JHOVE as both validate different aspects of the PDF file.
Media conch	EpubCheck	Other validators
MediaConch is an implementation checker, policy checker, reporter, and fixer that targets preservation-level audiovisual files (specifically Matroska, Linear Pulse Code Modulation (LPCM) and FF Video Codec 1 (FFV1)) for use in memory institutions, providing detailed and batch-level conformance checking. It has an interface accessible by the command line, a graphical user interface, or a web interface. While it validates several audiovisual file types, it does not validation every file format type. The policy checker part of the tool is useful, but it complex and requires a certain level of knowledge about the different file formats.	EpubCheck validates EPUB files and will extract technical and other embedded metadata. It checks things such as: OCF container structure OPF and OPS mark-up internal reference consistency It was largely developed by Adobe Systems and is currently supported by the International Digital Publishing Forum (IDPF). An online version of EpubCheck is available at: http://validator.idpf.org/	There are several other file format validation tools available. These include, but are not limited to: Warctools - for validation of warc files created for web archiving purposes BadPeggy - for validation JPEG, GIF, BMP, PNG. It will provide a technical validation and also a visual validation that will detect any visual corruption in a file BWF MetaEdit - used for extraction, validation, editing as well as embedding of metadata in Broadcast WAVE Format (BWF) files. It can also embed MD5 checksums in the file. The COPTR registry of digital preservation tools has a list of further file format validation tools.