LibGuides: Open Science: Reproducible research and data analysis

What is it?

Reproducibility means that research data and code are made available so that others are able to reach the same results as are claimed in scientific outputs. Closely related is the concept of replicability, the act of repeating a scientific methodology to reach similar conclusions. These concepts are core elements of empirical research.

Improving reproducibility leads to increased rigour and quality of scientific outputs, and thus to greater trust in science. The concept of reproducibility is directly applied to the scientific method, the cornerstone of Science, and particularly to the following five steps:

Formulating a hypothesis
Designing a study
Running the study and collecting the data
Analysing the data
Reporting the study

Each of these steps should be clearly reported by providing clear and open documentation, and thus making the study transparent and reproducible.

Read more about reproducibility and its issues in The Academy of Medical Sciences' symposium report of 2015 on improving research practice. A quick view provided in the image from their report:

Planning for reproducibility

Create a study plan or protocol
- Begin documentation at study inception by writing a study plan or protocol that includes your proposed study design and methods. Track changes to your study plan or protocol using version control. Calculate the power or sample size needed and report this calculation in your protocol as underpowered studies are prone to irreproducibility.
Choose reproducible tools and materials
- Whenever possible, choose software and hardware tools where you retain ownership of your research and can migrate your research out of the platform for reuse (see Open research software and Open Source).
Set-up a reproducible project
- Centralise and organise your project management using an online platform, a central repository, or folder for all research files. Within your centralised project, follow best practices by separating your data from your code into different folders. Make your raw data read-only and keep separate from processed data.
- When saving and backing up your research files, choose formats and informative file names that allow for reuse. File names should be both machine and human readable. In your analysis and software code, use relative paths. Avoid proprietary file formats and use open file formats.
Keep track of things
- Preregister important study design and analysis information to increase transparency and counter publication bias of negative results.
- Track changes to your files, especially your analysis code, using version control.
- Document everything done by hand in a README file. Create a data dictionary (also known as a codebook) to describe important information about your data.
- Consider using Jupyter Notebooks, for example, or other approaches to literate programming to integrate your code with your narrative and documentation.
Share and license your research
- Avoid supplementary files, decide on an acceptable license, and share your data using a repository.
- Share your materials so they can be reused. For more information, see Open research data and materials.
- License your code to inform about how it may be (re)used. Share notebooks and containers.
Report your research transparently
- Report and publish your methods and interventions explicitly and transparently and fully to allow for replication.

Questions, obstacles and common misconceptions

Everything is in the paper; anyone can reproduce this from there!

This is one of the most common misconceptions. Even having an extremely detailed description of the methods and workflows employed to reach the final result will not be sufficient in most cases to reproduce it. This can be due to several aspects, including different computational environments, differences in software versions, implicit biases that were not clearly stated, etc.

I don't have the time to learn and establish a reproducible workflow.

In addition to a significant number of freely available online services that can be combined and facilitate the setting up of an entire workflow, spending the time and effort to put this together will increase both the scientific validity of the final results as well as minimise the time of re-running or extending it in further studies.