Skip to Main Content

Research Data Management

RDM refers to the active curation of data throughout the research life-cycle.

Data planning

There are many decisions to make about data management even before data creation/collection - choosing hardware and software, intellectual property rights, ethics, regulatory issues, and more. These decisions will affect how data is accessed, used and preserved in the future. The best starting point is a data management plan, whether it is a funder requirement or not. Even informally noting your research plans and project guidelines can make your life easier.

Data management planning improves efficiency in research, ensures information is protected, allows results to be checked by others, improves reproducibility, improves exposure via sharing, and allows compliance with funders' policies. A Data Management Plan (DMP) is a key element of good data management. The DMP will describe the data management life cycle for the data to be collected, generated and processed.

Source: University of California, Santa Cruz: The Research Data Management Lifecycle

A DMP is intended to be a living document in which information can be made available in more detail through updates and periodic reviews as the research project progresses, and when significant changes occur. DMPs should therefore have clear version numbers.

The University of Cambridge summarises data management planning:

  • Planning ahead will ensure that you have adequate technological resources (like storage space and support staff time); your data will be robust and free from versioning errors and gaps in documentation; your data is backed up and safe from sudden loss or corruption; you can meet legal and ethical requirements; you are able to share your finalised data if you or your funder desires; and your data will remain accessible and comprehensible in the near, middle, and distant future.
  • Funders will expect you to prepare a DMP, and maybe even regular reviews of your DMP while managing your grant. (Read more about UFS funder requirements.)
  • Since research data management is simply good research practice, you will already be well on your way - DMPs are a way to show that you have thought about how to create, store, backup, share, and preserve your data.
  • Check your funder's requirements on what to cover in your DMP by checking the funder's website, or by using one of the tools in this guide to compile your DMP.
  • Ask for support - contact your research support staff early in your project so that there is enough time to plan. At the UFS you can start at the Digital Scholarship Centre for advice or consultation.

Tools and templates to create a DMP:

UFS templates

DMP tools available online:

Ownership and rights

Intellectual Property Rights (IPR), like copyright and patents, affect how you and others can use your research outputs, including your research data, in terms of its dissemination, future related research projects and associated profit and credit.

Who can help you with IPR questions?

At the UFS you can get more information at the Innovation office on IPR and on the Research website on Legislation.

What about data I find online?

Check to see how the data is licensed - IPR still applies even if you don't see a © attached or 'all rights reserved' notice. Creative Commons licensing will give you an idea of different licensing options and what to look for. And always cite the data. DataCite provides guidelines on how to do this.

Ethics and consent

Ethical guidelines are provided by funders and also by the University. In addition, laws governing personal data must be adhered to.

Retention

It costs money to keep all your data and files for future use, and confusing to find specific items in future. Selection of what data to retain, and for how long, will save storage space, staff hours ... in other words, money and time. Selecting what to keep or dispose of will involve subjective judgement, since it is not possible to know what exactly might be needed in future. The best way to select what to keep or delete is to abide by relevant funder/institutional policies and document all decisions (including the reasons).

The Digital Curation Centre lists questions that can help you decide what to keep and what to delete:

  • does my funder or the University need me to keep the data and/or make it available for a certain amount of time?
  • does this data constitute the 'vital records' of the project and therefore need to be retained indefinitely?
  • do I have the legal and intellectual property rights to keep and re-use the data? If not, can it be negotiated?
  • does sufficient documentation and descriptive information (metadata) exist to explain the data, and allow the data to be found wherever it is stored?
  • if I need to pay to keep the data, can I afford it?

Make sure you are aware of your responsibilities in terms of data protection if you need to store your data.

Durable formats

File formats might be dictated by the software you use, or by the conventions of your discipline. Sometimes you will have to choose from various formats. During the planning phase, keep the following in mind:

  • the software and formats used in past projects in your discipline
  • discipline-specific conventions
  • software that is compatible with the hardware you use
  • formats that will be easy to share and use in the future
  • formats that are at risk of obsolescence
  • formats that can be annotated with metadata for interpretation in the future

Here are some good file formats for the preservation of specific types of data:

Type of data Recommended formats Acceptable formats

Tabular data with extensive metadata
Variable labels, code labels, and defined missing values

SPSS portable format (.por)
Delimited text and command (‘setup’) file (SPSS, Stata, SAS, etc.)
Structured text or mark-up file of metadata information, e.g. DDI XML file

Proprietary formats of statistical packages: SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb)

Tabular data with minimal metadata
Column headings, variable names

Comma-separated values (.csv)
Tab-delimited (.tab)
Delimited text with SQL data definition statements

Delimited text (.txt) with characters not present in data used as delimiters
Widely-used formats: MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf), OpenDocument Spreadsheet (.ods)

Geospatial data
Vector and raster data

ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn optional)
Geo-referenced TIFF (.tif, .tfw)
CAD data (.dwg)
Tabular GIS attribute data
Geography Markup Language (.gml)

ESRI Geodatabase format (.mdb)
MapInfo Interchange Format (.mif) for vector data
Keyhole Mark-up Language (.kml)
Adobe Illustrator (.ai), CAD data (.dxf or .svg)
Binary formats of GIS and CAD packages

Textual data

Rich Text Format (.rtf)
Plain text, ASCII (.txt)
eXtensible Mark-up Language (.xml) text according to an appropriate Document Type Definition (DTD) or schema

Hypertext Mark-up Language (.html)
Widely-used formats: MS Word (.doc/.docx)
Some software-specific formats: NUD*ST, Nvivo and ATLAS.ti

Image data TIFF 6.0 uncompressed (.tif)

JPEG (.jpeg, .jpg, .jp2) if original created in this format
GIF (.gif)
TIFF other versions (.tif, .tiff)
RAW image format (.raw)
Photoshop files (.psd)
BMP (.bmp)
PNG (.png)
Adobe Portable Document Format (PDF/A, PDF) (.pdf)

Audio data Free Lossless Audio Codec (FLAC) (.flac)

MPEG-1 Audio Layer 3 (.mp3) if original created in this format
Audio Interchange File Format (.aif)
Waveform Audio Format (.wav)

Video data

MPEG-4 (.mp4)
OGG video (.ogv, .ogg)
Motion JPEG 2000 (.mj2)

AVCHD video (.avchd)
Documentation and scripts

Rich Text Format (.rtf)
PDF/UA, PDF/A or PDF (.pdf)
XHTML or HTML (.xhtml, .htm)
OpenDocument Texxt (.odt)

Plain text (.txt)
Widely-used formats: MS Word (.doc/.docx), MS Excel (.xls/.xlsx)
XML marked-up text (.xml) according to appropriate DTD or schema, e.g. XHTML 1.0

A note on non-proprietary or open formats: these formats are publicly available (open source) and could be supported longer than proprietary software, or, at least, there are no restrictions on it use or code.

Storage and backup

It is important to store research data securely and safely in order to avoid loss, damage, theft or unlawful access. Think about the following when deciding to where to store research data:

  • Will the storage solution provide enough storage space?
  • Will my collaborators/team members be able to access my data?
  • Can the data be accessed from different devices?
  • For how long will I need to store my data?
  • How much will it cost?
  • Is the storage solution secure, e.g. who can access it, is it encrypted, might data be lost or hacked?
  • Are regular backups made?
  • Does the storage solution meet data protection, funder and other requirements?
  • What happens if I leave the University?

figshare at the University of the Free State can be used for storing and in-project sharing of research data. It provides POPIA and GDPR complaint, secure cloud-based storage space, which is regularly backed up. This service is available to research staff and postgraduate students. figshare can be used for short term or long term storage of data, as well as sharing data where appropriate.

Storing data on portable storage media (CDs, DVDs and memory sticks) and personal computers and laptops is inadvisable. If you do use any of these, make sure that the device or media is encrypted, and that you have multiple copies in more secure storage space.

Here are some tips for safe data storage and backups.

Storing sensitive or confidential data?

  • Limit physical access or encrypt data.
  • The UFS has a secure data hosting service for its researchers.
  • Avoid compromising sensitive and confidential data in future by storing information about the data's sensitivity with the data itself, i.e. your data documentation or metadata description. See more on documentation below.

What to back up and when?

  • Think about which files you will need in order to re-create/restore the data that are crucial to your work.
  • The more important data often change more, and therefore need more regular backups.
  • Since backups can be time-consuming and expensive, focus on backups for key information, programs, algorithms, or documentation that you will need to restore the data in case of data loss.

Security

Personally identifiable information can be direct (information that, on its own, allows you to identify an individual, e.g. names, email addresses including a name, fingerprints, facial photographs, etc.), strongly indirect (information that allows you to identify an individual through minimal effort, e.g. postal addresses, telephone numbers, URLs of personal pages, etc.), and indirectly (information that allows you to identify an individual when linked with other information, e.g. age, location, gender, job title, etc.).

A key approach to protecting personally identifiable information is anonymisation, meaning the irreversible removal and deletion of personal identifiers. In quantitative research, sometimes all that is needed is the removal of direct identifiers. More complex datasets (with more free text) might need more anonymisation techniques. In qualitative research anonymisation is more complicated and needs personal judgement. Here are some best practices for guidance.

Organising and documenting data

Choose a logical and consistent way to name and organise your files - this will allow you to easily locate and use them. Think about your naming conventions and file structure at the beginning of your project to ensure consistency and prevent version control problems.

Organising files

  • Group files in folders so that information on a particular topic can be found in one place.
  • Check to see if your discipline/department/team already has existing procedures you need to adhere to.
  • Name folders appropriately to avoid confusion in shared workspaces. This will make it easier for anyone to find relevant information.
  • Be consistent in the naming scheme for your folders. It is better to decide on a naming scheme at the start of the project.
  • Structure folders hierarchically, with a limited number of folders on broad topics, and more specific folders within.
  • Separate ongoing and completed work. Keep folders for work you are actively working on, and move the work to a different folder when you are done.
  • Ensure that you regularly back up your files.
  • Review your files regularly, and at the end of a project, to make sure files are not kept needlessly.

File naming

  • Decide on a file naming convention at the beginning of your research project.
  • Choose names that are meaningful to you and your team.
  • Be consistent. Consider using a standard vocabulary.
  • Think about punctuation and how this will be used in file names.
  • Think about dates and how this will display in file names.
  • Decide on the order of file name elements, so that files on the same topic are listed together and easily found.
  • Decide on the number of digits to be used in numbering when files are listed numerically, for example 01, 002, etc.
  • Important! Use a revision numbering system for version control. Include naming conventions on draft documents. Including a version control table is useful to note changes and dates alongside the document version. And agree on who decides if a document is the final version, and naming it as such.

Documentation and metadata

  • Documentation and metadata (data about your data) are all the information needed to interpret, understand and use data.
  • Start with your documentation and metadata at the beginning of your research project and continue to add information throughout the project.
  • Include procedures for documentation and metadata in your data management plan (DMP).
  • You can use supporting documentation, i.e. separate files that accompanies your data, to provide context, explanation, or instructions on confidentiality, and data re-use. These documents could include working papers, laboratory books, questionnaires, interview guides, project reports, publications. It should be structured to easily identify and find the data. It could be according to a discipline or international standard.
  • You can use embedded documentation, i.e. information included within the data or document itself. Examples include code, field and label descriptions, descriptive headers or summaries, information in a document's properties function. See the list of resources below for more metadata tracking and data standards.

Managing references

  • Since a research project can last for a long time, reference management can help you keep track of your citations as you work (and with reference management software even automatically create your bibliography).
  • At the UFS your librarian can help you with reference management.

Tip: We send and receive a lot of emails every day, making it difficult to track down that important research project related email you urgently need. Organising emails for your research project can help. Archive your old emails. Delete emails that you do not need. Use folders to store messages. Make sure that you use careful version control when sending/receiving attached documents (consider using other, more secure methods to exchange data).

Additional resources:

Sharing and disseminating data

There is an increasing trend to share research data, from funders who want to avoid duplication of effort and make data available for other researchers to discover, examine and build upon, to the support of a culture of openness that deters fraud and encourages interdisciplinary research. Sharing data without clear terms of use can make re-use even more difficult, since there are many complexities and ambiguities with, for example, the rights of research databases and its various elements.

Assigning Creative Commons licenses to your research data is one way of giving you control over how your data may be used, from putting your data in the public domain, to reserving all rights. Explore other ways to licenses and waivers for your research data.

Read more about data citation below.

Preserving data

Research data preservation is the process of maintaining access to the data so that it can be found, understood and used in the future. Preservation goes beyond immediate storage and back-up issues and must be considered from the point of research data creation and throughout the entire life-cycle of the data.

Consider the following:

  • Funders and/or institutions now require preservation of research data for 10 years or longer.
  • Storage media degrades or become obsolete over time, leading to data loss.
  • Files may need software that is no longer available in the future, making the data inaccessible, unreadable or unreliable.

(Check out this list of 'endangered' digital materials by the Digital Preservation Coalition.)

  • Back-ups keep data available in the short term, but does not guarantee longevity. Preservation is the active long-term management of research data.
  • If research data has no supporting documentation, it may become unintelligible.

Data preservation should be considered as early as possible. Good practice would be to add preservation requirements, and how it will be achieved, in the data management plan (DMP). Data preservation aims to keep both content (data) and context (metadata) safe for future re-use, so it will be significantly easier to do if preservation is already considered during the planning phase.

Read more about preservation strategies by the Digital Curation Centre or check out this how-to guide for beginners in digital preservation.

Discover and re-use

Citations has always been a key element in scholarly communication. By citing research data in the same way proper attribution and credit is possible, enabling reproducibility of findings, encouraging a faster research progress, supporting collaboration and re-use of data, and a way of sharing research data.

DataCite recommends the following citation format, but encourages various disciplines to develop citation systems that work well for them:

Creator (PublicationYear). Title. Publisher. Identier.

The version of the dataset and the resource type could also be included:

Creator (PublicationYear). Title. Version. Publisher. ResourceType. Identifier.

The identifier is usually a Digital Object Identifier (DOI), an alphanumeric string assigned to uniquely identify a digital object. It is tied to the object's metadata description and URL (location).

When choosing a repository for your data, select one that will also provide your dataset with a DOI. The UFS's figshare repository will provide this service.

Report a problem