Open, reproducible science

using Make, RMarkdown and Pandoc

Shaun Jackman @sjackman

2014-10-09 at VanBUG, Vancouver, Canada

Creative Commons Attribution License

Fork me on GitHub!

Shaun Jackman

Genome Sciences Centre, BC Cancer Agency
Vancouver, Canada
@sjackman
github.com/sjackman
sjackman.ca

Open and reproducible science

  • Open science
  • Repeatable science
    • by you
    • by others
  • Reproducible science

Open science

Open science

  • Publish all research outputs
  • Archive manuscripts
  • Publish papers in open-access journals
  • Sign peer reviews
  • Participate in public discussion, like Twitter

Publish all research outputs

Reproducible science

Repeatable science

Given the same data and code…

  • Reproduce the same results
  • At least by yourself, this should be the minimum bar
  • Hopefully repeatable by others as well

Reproducible science

Given the manuscript…

Another scientist can

  • Repeat the experiment
  • Analyse the data
  • Draw the same conclusion

Repeatable vs. reproducible science

Reproducibility is fundamental to science

  • Often we don't even accomplish repeatable science
  • So let's start there

Repeatable science

Managing software

We used Linuxbrew to install the required software from Homebrew-science version 2014-08.

Homebrew navigates dependency hell

Dependencies of bioinformatics tools in Homebrew
Dependencies of bioinformatics tools in Homebrew

Publish data

Best way to set back your competitors is to release your #data. That way they have to analyze their data & your data

C. Titus Brown @ctitusbrown
BOSC 2014 keynote
A History of Bioinformatics (in the Year 2039)

Version control

  • git/GitHub for (almost) everything!
  • Maybe not big, raw data
  • For experimental design data
  • For results and summary statistics
  • Data in a plain-text format, like TSV
  • GitHub renders TSV pretty!
GitHub renders TSV pretty!
GitHub renders TSV pretty!

GitHub renders TSV pretty!

GitHub renders TSV pretty!
GitHub renders TSV pretty!

A reproducible manuscript

One Makefile

  • Downloads the data
  • Runs the command-line programs
  • Performs the statistical analyses using R
  • and Generates the TSV tables
  • Renders the figures using ggplot2
  • Renders the supplementary material using RMarkdown
  • Renders the manuscript using Pandoc

Turns this

UniqTag Markdown
UniqTag Markdown

Into this

UniqTag PDF
UniqTag PDF

Workflow

Plain Text, Papers, Pandoc by Kieran Healy

I promise this is less insane than it appears
I promise this is less insane than it appears

Make is beautiful

Tell Make how to create one type of file from another
and which files you want to create.

Make looks at which files you have
and figures out how to create the files that you want.

Make example

%.bam: %.sam
    samtools view -Sb $< >$@

%.sort.bam: %.bam
    samtools sort $< $*.sort

%.bam.bai: %.bam
    samtools index $<
touch hello.sam
make hello.sort.bam.bai
samtools view -Sb hello.sam >hello.bam
samtools sort hello.bam hello.sort
samtools index hello.sort.bam

Markdown for the manuscript

Markdown is a plain-text typesetting language

A header
========

A list:

+ This text is *italic*
+ This text is **bold**

A header

A list:

  • This text is italic
  • This text is bold

RMarkdown

  • RMarkdown interleaves text with code in R
  • Code that calculates summary statistics
  • Code that generates tables
  • Code that renders figures using ggplot2
  • RMarkdown is ideal for supplementary material

Pandoc

Pandoc renders attractive documents and slides
from plain-text typesetting formats

It converts between every format known (just about)

  • Markdown
  • HTML
  • LaTeX
  • PDF
  • ODT and docx (yes, really)

fin

Shaun Jackman

Genome Sciences Centre, BC Cancer Agency
Vancouver, Canada
@sjackman
github.com/sjackman
sjackman.ca