class: top, left, title-slide # Opinionated practices for teaching reproducibility: ## motivation, guided instruction
and practice ###
@TiffanyTimbers
(UBC) ### 2021-10-29 ###
bit.ly/opin-teach-repro
--- class: center, middle # A bit about me and what I teach at UBC <img src="img/tiff.jpg" height=300> --- ### Teaching experience - Software & Data Carpentry Instructor - Rstudio {tidyverse} Instructor - Assistant Professor of Teaching, Dept. of Statistics, UBC - Co-Director of the UBC Master of Data Science program, UBC .pull-left[ <img src="img/logos.png" height=350> ] .pull-right[ #### Graduate courses: - [DSCI 523: Programming for data manipulation](https://github.com/UBC-MDS/DSCI_523_r-prog) - [DSCI 552: Statistical inference and computation I](https://github.com/UBC-MDS/DSCI_552_stat-inf-1) - [DSCI 522: Data science workflows](https://github.com/UBC-MDS/DSCI_522_dsci-workflows) - [DSCI 524: Collaborative software development](https://github.com/UBC-MDS/DSCI_524_collab-sw-dev) #### Underaduate courses: - [DSCI 100: Introduction to data science](https://ubc-dsci.github.io/dsci-100/README.html) - DSCI 310: Reproducible and trustworthy workflows for data science (new course!) ] --- class: inverse, center, middle # Data Science: ### *the study, development and practice* ### *of reproducible and auditable processes* ### *to obtain insight from data* --- class: middle ## Reproducible analysis: #### *reaching the same result given the same input, computational methods and conditions*<sup>1</sup>. ## Transparent analysis, #### *a readable record of the steps used to carry out the analysis as well as a record of how the analysis methods evolved*.<sup>2</sup> .footnote[ [1] National Academies of Sciences, 2019 [2] Parker, 2017 and Ram, 2013 ] --- class: middle ## Why do we adopt this definition? We believe that data science work should both bring insight and employ reproducible and auditable methods so that trustworthy results and data products can be created. Data products can be built via other methods, but we lack confidence in how the results or products were created. We believe this stems from non-reproducible and non-auditable analyses: 1. lacking evidence that the results or product could be regenerated given the same input computational methods, and conditions 2. lacking evidence of the steps taken during creation 3. having an incomplete record of how and why analysis decisions were made --- class: middle ## Examples of things that can go wrong without reproducible practices - An interesting result that you cannot recreate đ - Your email inbox is full of information related to the project that only you have access too đ« - A small change to the analysis code requires re-running the entire thing, *and takes hours...* đ§ - Activation time to becoming productive after taking a break from the project is hours to days đŽ - Code that can only be run on one machine, *and you don't know why...* đ” --- class: inverse, center, middle ## *So if reproducibility is so important for data science, why is it hard to teach it?* --- class: middle ## Possibilities: 1. Lack of awareness of the problem? 2. Reproducibility is an inconvenient means to end, rather than an important skill? 3. Reproducibility does not directly lead to novel insights? --- class: middle ## Key things for teaching reproducibility ### 1. placing extra emphasis on motivation ### 2. guided instruction ### 3. lots of practice!!! --- class: inverse, center, middle # Why do we need "extra" motivation? --- class: center ## Git is hard! <img src="https://imgs.xkcd.com/comics/git_2x.png" height=475> *Source: https://xkcd.com/1597/* --- ## Git is hard! Command to view the state of the repository 2 commits ago: ```bash git checkout HEAD~2 ``` Output đ€Ż: ```bash Note: switching to 'HEAD~2'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by switching back to a branch. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -c with the switch command. Example: git switch -c <new-branch-name> Or undo this operation with: git switch - Turn off this advice by setting config variable advice.detachedHead to false HEAD is now at cea9a00 updated site ``` --- ## Git is hard! <img src="img/jenny-bryan-burn-it-all-down.png" width=700> --- class: center ## LaTeX errors are often cryptic to new learners <img src="img/rmd_cat.jpeg" height=450> *Source: https://twitter.com/overflow_meme/status/1232185254280138753* --- class: center ## LaTeX errors are often cryptic to new learners <img src="img/r-latex-error.png" height=500> --- class: center ## Creating environments is painstakingly slow <img src="https://www.meme-arsenal.com/memes/f8e3714ed0d4a19b65ebe61006ac9bde.jpg" height=450> *Source: https://www.meme-arsenal.com/en/create/meme/1759571* --- ## Creating environments is painstakingly slow Dockerfile that adds 3 packages to a [rocker](https://www.rocker-project.org/) Docker image: ```Docker FROM rocker/verse:4.1.1 # install R packages RUN apt-get update -qq && install2.r --error \ --deps TRUE \ plotly \ GGally \ cowplot ``` Build the Dockerfile: ```bash time docker build . [+] Building 198.4s (6/6) FINISHED ... ``` That took about 5 minutes to build on my Macbook pro!!! --- class: middle ## Examples of how we motivate learning ### 1. Tell stories from the trenches ### 2. Study cases of failures with real world consequences ### 3. Let them fail (in a controlled manner) --- ### Tell stories from the trenches -- *As a Masters student, I started to use R to do my statistical analysis. I obtained the results I needed from running my code in the R console and copying the results into the word document that was my manuscript. Six months later we were working on revisions requested by the reviewers and I could not remember which version of the code I ran to get my results. I eventually figured it out through much trial and error, but the process was inefficient and very stressful.* -- *As a Masters student, I spent many hours customizing the colors of a figure that had many points and lines on it using Adobe Illustrator. The next week in a meeting with my supervisor, she very correctly pointed out that the color scheme I chose would be problematic for color-blind people. I then had to spend many more hours repeating the same work to change the colors of all the lines and points to fix it. It was very frustrating to repeat such tedious and time consuming work, and I was not able to meet other deadlines on time because I had to redo this.* -- *I have one project from my PhD where I am still supposed to help out finish some things off but dread doing so because it is unorganized and is not tracking the changes in the output with the changes in the code. So although we are using version control there, the input and output are gigabytes of images and we didn't set up any good way to keep track of which images were analyzed with which version of the code. We also "didn't have time" to write tests for the code, so there is no guarantee there are not unintended side effects from the changes made even if they are tracked in version control.* --- class: middle ### Study cases of failures with real world consequences | Reproducibility error | Consequence | Source(s) | |----------------------------------------------------------|------------------------------------------------------|------------------------------------------| | Limitations in Excel data formats | Loss of 16,000 COVID case records in the UK | Kelion, 2020 | | Automatic formatting in Excel | Important genes disregarded in scientific studies | Zeeberg et al., 2004; Ziemann et al., 2016 | | Deletion of a cell caused rows to shift | Mix-up of which patient group received the treatment | Wallensteen et al., 2018 | | Using binary instead of explanatory labels | Mix-up of the intervention with the control group | Aboumatar et al., 2019 | | Using the same notation for missing data and zero values | Paper retraction | Whitehouse et al., 2021 | | Incorrectly copying data in a spreadsheet | Delay in the opening of a hospital | Picken, 2020 | --- class: middle ### Example: Let them fail (in a controlled manner) **Course:** DSCI 522 (workflows for data science) .pull-left[ **Topic:** Containerization via Docker ] <center> <img src="https://www.docker.com/sites/default/files/d8/2019-07/vertical-logo-monochromatic.png" width=100> </center> Give students a data analysis pipeline project on GitHub, and ask them to run it. They have done things like this before, but this time, we add several packages that they haven't used before. Let them work to install the packages and get the analysis up and running. Give them a second data analysis pipeline project on GitHub, and ask them to run it. This also has several packages they haven't used before, but this time there are instructions for running it in a Docker container. --- class: inverse, center, middle # Why guided instruction? --- class: center, middle ### *From our experience, reproducibility is not something that most people or students figure out on their own, or if they do, it is not an efficient process...* --- class: middle ### Excerpt from Roger Peng's blog post on <br> ["The Role of Theory in Data Analysis"](https://simplystatistics.org/posts/2018-12-11-the-role-of-theory-in-data-analysis/): #### *There is no need for a new data analyst to learn about reproducibility âfrom experienceâ. We donât need to lead a junior data analyst down a months-long winding path of non-reproducible analyses until they are finally bitten by non-reproducibility (and therefore âlearn their lessonâ). We can just tell them* #### *"In the past, weâve found it useful to make our data analyses reproducible. Hereâs a workflow to guide you in your own analyses."* #### *With that one statement, we can âcompressâ over 20 years of experience.* --- class: middle ## Examples of how we use guided instruction ### 1. Pre-lecture activities ### 2. Live demonstration ### 3. Worksheets --- class: inverse, center, middle ### However, there be dragons... --- class: center, middle <img src="img/master_to_main.png" width=750> ### *...fast moving field means constant curation* *Source: https://github.com/ <br>& http://www.piwai.info/blog-meets-monster-octocat* --- ### Example: guided instruction **Course:** DSCI 100 (Introduction to data science) .pull-left[ **Topic:** Version control ] <img src="https://miro.medium.com/max/1400/1*n5MnDrinAXeY2NpCX9H8cw.jpeg" width=300> Students read about version control before attending lectuer. In lecture there is a live demonstration 0f how to use the GitHub website to create a repository, and a Git GUI to clone the repository as well as push code to the repository on GitHub. Students then work through a worksheet, that guides them to do the same thing, which asks them questions along the way about what they are doing at each stage to reinforce the conceptual aspects of what they are doing. **Artifacts for this example:** - [Accompanying textbook reading](https://ubc-dsci.github.io/introduction-to-datascience/Getting-started-with-version-control.html) - [Live demonstration](https://youtu.be/attPo4zEElU) - [Worksheet to guide students](https://github.com/UBC-DSCI/dsci-100-assets/blob/master/2021-spring/materials/worksheet_05/worksheet_05.ipynb) --- ## Pre-lecture activity <img src="img/textbook-reading.png" width=600> [Link to pre-reading](https://ubc-dsci.github.io/introduction-to-datascience/Getting-started-with-version-control.html) --- ## Live demonstration <img src="img/live-demo.png" width=825> [Link to live demo recording](https://youtu.be/attPo4zEElU) --- ## Worksheet to guide students <img src="img/worksheet.png" width=775> [Link to worksheet](https://github.com/UBC-DSCI/dsci-100-assets/blob/master/2021-spring/materials/worksheet_05/worksheet_05.ipynb) --- class: inverse, center, middle # Why lots of practice? --- class: middle ## **Mastery of a subject often involves consolidating ideas, concepts and theories into long-term memory.** ## **And consolidating most things into long-term memory requires repetition**<sup>1</sup>. .footnote[ [1] Ebbinghaus, Hermann (1913). âMemory (ha ruger & ce bussenius, trans.)â In: New York: Teachers College.(Original work published 1885) 39. ] --- class: center, middle ### However, when we teach reproducibility workflows and skills, we want students to do more than learn them, we want them to use and practice them. ### We want to actually change their habits or behaviours. --- class: inverse, center, middle # Habit formation ### *the triggering of behavior from contextual cues*<sup>1</sup>. <!--In the context of reproducible workflows these cues are the tasks that students desire to execute, such as saving a file after adding new content or wanting to share a document with a colleague.--> .footnote[ [1] Gardner, Benjamin and Amanda L. Rebar (Apr. 2019). âHabit Formation and Behavior Changeâ. en. In: Oxford Research Encyclopedia of Psychology. Oxford University Press. isbn: 978-0-19-023655-7. doi: 10.1093/acrefore/9780190236557.013.129. ] --- ## How does habit formation help? Habits protect individuals from motivational lapses, where a desired good behavior is not expressed due to a momentary lack of willpower<sup>1</sup>. .pull-left[ <img src="https://media.giphy.com/media/3240pzelRQkjj9XaiK/source.gif"> *Source: https://giphy.com/gifs/reaction-3240pzelRQkjj9XaiK* ] .pull-right[ - Non-reproducible workflows tend to be "easier" and life gets "busy"... - Without reproducibility habits, its a perfect storm to do the non-reproducible alternative **(even when you know it's better!)** - We want to get students to a point where they opt for the reproducible workflow âby defaultâ. ] .footnote[ [1] Gardner, Benjamin and Amanda L. Rebar (Apr. 2019). âHabit Formation and Behavior Changeâ. en. In: Oxford Research Encyclopedia of Psychology. Oxford University Press. isbn: 978-0-19-023655-7. doi: 10.1093/acrefore/9780190236557.013.129. ] --- ## How are habits formed? - best learned through frequent, regular, and sustained cue exposure - some studies have reported that the median is at least around 70 days before reaching the plateau phase of habit development<sup>1,2</sup>. - ### habits == LOTS OF PRACTICE! .footnote[ [1] Lally, Phillippa et al. (2010). âHow Are Habits Formed: Modelling Habit Formation in the Real Worldâ. en. In: European Journal of Social Psychology 40.6, pp. 998â1009. issn: 1099-0992. doi: 10.1002/ejsp.674. [2] Fournier, Marion et al. (Nov. 2017). âEffects of Circadian Cortisol on the Development of a Health Habitâ. eng. In: Health Psychology: Official Journal of the Division of Health Psychology, American Psychological Association 36.11, pp. 1059â1064. issn: 1930-7810. doi: 10.1037/hea0000510. ] --- class: middle ## Examples of how we embed lots of practice in the data science classroom ### 1. Student practice following live demos ### 2. Lots of low stakes assessments with small/short problems ### 3. Learning technologies & platforms use authentic data science reproducibility tools --- class: middle ### Example: lots of practice!!! **Course:** Almost all Master of Data Science courses (~ 20 courses) .pull-left[ **Topic:** Version control ] <img src="https://miro.medium.com/max/1400/1*n5MnDrinAXeY2NpCX9H8cw.jpeg" width=300> We use GitHub as our course management system. Homework instructions and assignments are distributed to students as GitHub repositories. They have to submit their assignment by pushing their work to the GitHub repositories. We also require that they have 3 commits associated with each homework submission. By the end of the program, they have version controlled their work in > 96 different repositories!!! **Tools for using GitHub as a learning management system:** - [GitHub Classroom](https://classroom.github.com/) - [Classy](https://github.com/ubccpsc/classy) - [rhomboid](https://github.com/mgelbart/rhomboid) --- ### Example: lots of practice!!! Home table of a UBC Master of Data Science student's GitHub homework repositories: .pull-left[ <img src="img/home-table1.png" width=475> ] .pull-right[ <img src="img/home-table2.png" width=450> ] --- ## Staging might be useful for practicing more advanced reproducibility workflows Some reproducibility workflows require some time consuming, or complex prerequisite work to have been done before it makes sense to do them. In such cases. "pre-baking" exercises to a particular point is helpful for providing practice opportunities. .pull-left[ <img src="img/martha-stewart.png" width=500> Source: https://peopletv.com/video/martha-bakes-angel-food-cake/ ] .pull-right[ Examples include: - automating the running of scripts via a data analysis pipeline (e.g., Makefile) - using `bookdown` or Jupyter book to automate figure and table numbering - organizing GitHub issues into milestones and a project board ([example exercise](https://github.com/ttimbers/mds-homework)) - performing a code review on a pull requests ([example exercise](https://github.com/ttimbers/review-my-pull-request)) ] --- class: inverse, center, middle # Wrap up --- class: middle ## Key things for teaching reproducibility ### 1. Extra emphasis on motivation ### 2. Guided instruction ### 3. Lots of practice!!! --- class: center, middle <img src="https://media.giphy.com/media/x1yhRwlqxBiQU/source.gif" width=800> *Source: https://giphy.com/gifs/opinion-the-big-lebowski-x1yhRwlqxBiQU* --- class: middle ## Anectodal evidence - reduced technical debt in UBC Master of Data Science Capstone projects as we have gained more experience teaching reproducibility ## Feedback from Alumni - *"I had my [COMPANY] Git training today for the data engineers/scientists and I wanted to tell you that everything I learned in your class about git was extremely helpful. You did such an amazing job teaching everything we need to know about git and working in technology, (I personally think I learned much more in your class than in my training here in SF)."* - *"...the use of Git and Docker and the strong focus on reproducibility are of the best technical skills our program offer."* --- ### References Aboumatar, Hanan and Robert A. Wise (Oct. 2019). âNotice of Retraction. Aboumatar et al. Effect of a Program Combining Transitional Care and Long-Term Self-Management Support on Outcomes of Hospitalized Patients With Chronic Obstructive Pulmonary Disease: A Randomized Clinical Trial. JAMA. 2018;320(22):2335-2343.â In: JAMA 322.14, pp. 1417â1418. issn: 0098-7484. doi: 10.1001/jama.2019.11954. Fournier, Marion et al. (Nov. 2017). âEffects of Circadian Cortisol on the Development of a Health Habitâ. eng. In: Health Psychology: Official Journal of the Division of Health Psychology, American Psychological Association 36.11, pp. 1059â1064. issn: 1930-7810. doi: 10.1037/hea0000510. Gardner, Benjamin and Amanda L. Rebar (Apr. 2019). âHabit Formation and Behavior Changeâ. en. In: Oxford Research Encyclopedia of Psychology. Oxford University Press. isbn: 978-0-19-023655-7. doi: 10.1093/acrefore/9780190236557.013.129. Kelion, Leo (Oct. 2020). âExcel: Why Using Microsoftâs Tool Caused Covid19 Results to Be Lostâ. en-GB. In: BBC News. Lally, Phillippa et al. (2010). âHow Are Habits Formed: Modelling Habit Formation in the Real Worldâ. en. In: European Journal of Social Psychology 40.6, pp. 998â1009. issn: 1099-0992. doi: 10.1002/ejsp.674. National Academies of Sciences, Engineering, and Medicine and others (2019). Reproducibility and replicability in science. National Academies Press. --- ### References (continued) Parker, Hilary (2017). âOpinionated analysis developmentâ. In: PeerJ Preprints 5, e3210v1. Ram, Karthik (2013). âGit can facilitate greater reproducibility and increased transparency in scienceâ. In: Source code for biology and medicine 8.1, pp. 1â8. Wallensteen, Lena et al. (2018). âRetraction notice to" Evaluation of behavioral problems after prenatal dexamethasone treatment in Swedish adolescents at risk of CAH"[Hormones and Behavior 85C (2016) 5-11]â. In: Hormones and behavior 103, p. 140. Whitehouse, et al. (July 2021). âRetraction Note: Complex Societies Precede Moralizing Gods throughout World Historyâ. en. In: Nature 595.7866, pp. 320â320. issn: 1476-4687. doi: 10.1038/s41586-021-03656-3. Zeeberg, Barry R et al. (2004). âMistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformaticsâ. In: BMC bioinformatics 5.1, pp. 1â6. Ziemann, Mark, Yotam Eren, and Assam El-Osta (2016). âGene name errors are widespread in the scientific literatureâ. In: Genome biology 17.1, pp. 1â 3. ### Credits [Title slide illustration](https://www.freepik.com/free-vector/couple-professionals-analyzing-graphs_6974868.htm) was created by pch.vector - www.freepik.com. --- ### Acknowledgements .pull-left[ UBC Master of Data Science teaching team (past and present) - Tomas Beuzen - Vincenzo Coia - Giulio Valentino Dalla Riva - Florencia D'Andrea - Mike Gelbart - Gittu George - Varada Kolhatkar - Rodolfo Lourenzutti - Firas Moosvi - Quan Nguyen - **Joel Ostblom** (co-author on preprint of this talk) - Alexi RodrĂguez-Arelis - Arman Seyed-Ahmadi DSCI 100 teaching team - Trevor Campbell - Melissa Lee ] .pull-right[ 2021-22 Master of Data Science teaching team <img src="https://ubc-mds.github.io/img/team/group.jpeg"> ] --- class: inverse, center, middle # Questions? ### @TiffanyTimbers #### talk slides: *[bit.ly/opin-teach-repro](https://bit.ly/opin-teach-repro)* #### preprint: https://arxiv.org/abs/2109.13656