Web site last updated 07 March 2024 An extended book version of these pages is available here.
Introduction
These web pages complement the data science chapter in the 4th Edition of BCT’s Bat Survey Guidelines, and hopefully encourage ecologists to make the most of their bat survey data. They also demonstrate literate programming with Quarto®1 and R Markdown2 which can vastly improve workflow3 (welcome to the world beyond Excel).
The term data science is used, as this covers data collection, management, processing, analysis, visualisation, interpretation, reporting and reproducibility. Statisticians would state this is what they have always done in statistics! There is no doubt the phenomenon of data science is growing, most major universities now offer a degree course in the subject and together with the increasing power of computer algorithms; data science is more than just a rebranding of statistics (Donoho 2017).
The data science is applied through literate programming outlined in Figure 1. This enables efficient reporting of bat data4 from a simple table, such as a count of bats, to the output and interpretation of machine learning in a fully formatted report; plus everything in-between, all accomplished through open source R5(R Core Team 2023) and RStudio (Posit team 2022). The beauty of literate programming is reproducibility; an essential tenet of all scientific study, in the commercial and legal world it makes for defensible reporting. The Reporting page has literate programming examples for a Word report and PowerPoint presentation.
Figure 1: Literate Programming
Much is said about the digital skills gap 6; in a small way, these data science pages aim to improve digital skills by demonstrating modern data science methods7. For a balanced understanding of the link between digital skills and data science see the Royal Statistical Society article
You may ask what’s wrong with the spreadsheet for data science? On a practical level spreadsheets are hard to maintain, find errors in or see there was an error in the first place, they are poor at handling dates8 and difficult to share with others. For spreadsheet blunders listen to Tim Harford’s More or less on BBC sounds9; for a litany of mathematical mistakes, many involving spreadsheets, see Matt Parker’s book Humble Pi A comedy of Maths Errors(Parker 2019). On a positive note spreadsheets are handy and easy to use for a few lines of data.
Getting Started
To help ecologists on their data science journey, all the code making the graphs and analysis in these web pages is free to copy and use; just click on Show the code, copy to the clipboard 10, paste into the R environment and run. If new to R and RStudio see Section.
A Show the code is given below, the code produces Figure 2. The code copied to the clipboard is designed to run as a standalone chunk (or R script)11; the code loads the required R libraries and data.
Code
### Libraries Usedlibrary(tidyverse) # Data Science packages - see https://www.tidyverse.org/library(treemapify) # extension to ggplot for plotting treemaps -# see https://cran.r-project.org/web/packages/treemapify/vignettes/introduction-to-treemapify.htmllibrary(ggthemes) # for colour pallet "Tableau 10"# Install devtools if not installed# devtools is used to install the iBats package from GitHubif (!require(devtools)) {install.packages("devtools")}# If iBats not installed load from Githubif (!require(iBats)) { devtools::install_github("Nattereri/iBats")}library(iBats)#### Add data and time information to the iBats statics bat survey data set using the iBats::date_time_infostatics_plus <- iBats::date_time_info(statics)graph_data <- statics_plus %>%group_by(Species, Month) %>%tally()ggplot(graph_data, aes(area = n, fill = Month, label = Species, subgroup = Month)) +scale_fill_tableau(palette ="Tableau 10") +#geom_treemap(colour ="white", size =2, alpha =0.9) +geom_treemap_subgroup_border(colour ="black", size =5, alpha =0.9) +geom_treemap_subgroup_text(place ="centre", grow = T, alpha =0.9, colour ="grey20", min.size =0) +geom_treemap_text(colour ="grey90", place ="topleft", fontface ="italic", reflow = T, min.size =0, alpha =0.9) +theme_bw() +theme(legend.position ="none") # No legend
Figure 2: Example Graph: Monthly Bat Activity from the statics data set in the iBats Package
Coding Tip
Rather than write code from scratch adapt working code to your own purposes.
Literate programming facilitates the use of coding languages other than R such as Python12, and Julia13. Computer languages can be mixed in the same literate programming document; for example with a chuck of R code doing the data manipulation and another chunk of Python code performing the machine learning. Coding languages applied to data science are developing rapidly in terms of their ability, speed of execution, and user friendliness14; literate programming provides the framework for ecologists to keep their data science skills moving forward.
Evidence Led Reporting
Literate programming assists data science and reproducibility, promoting evidence led reporting and decision making. Reports are often produced for regulatory bodies, central government or local authorities, these organisations have mandatory strategies for the use of science, evidence and evaluation in there advice and actions, and the legality of their decisions(AUTOKEY?).
Install R, RStudio and Packages
Download and install the latest version of R https://cran.r-project.org/bin/windows/base/. Download the version for your operating system; R can be downloaded for Windows, Mac & Linux.
It is recommended R is used through the RStudio IDE. Download and install the latest version of RStudio from their web page https://www.rstudio.com/products/rstudio/#Desktop. Download the free desktop version.
Install the iBats Package from GitHub
The iBats package contains example data and functions that help with the Data Science of bat survey results. To install this package use the code below in the RStudio Console; one line at a time. The package is installed from GitHub.
Free and Open Source Software (FOSS) constitutes 70-90% of any modern software solution15. R and RStudio are open source software that have made data science more open, intuitive, accessible, and collaborative. As a Public Good16 the value of FOSS is yet to be fully recognised. FOSS is provided by a large community, without whom these web pages would not be written; some of this community are acknowledged as individuals in the references section of the Resources page.
Parker, Matt. 2019. Humble Pi a Comedy of Maths Errors. Penguin.
Posit team. 2022. RStudio: Integrated Development Environment for r. Boston, MA: Posit Software, PBC. http://www.posit.co/.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Footnotes
Quarto® is an open-source scientific and technical publishing system: see https://quarto.org/.↩︎
The data skills gap is relevant to professional bodies, such as the Chartered Institute of Ecology and Environmental Management (CIEEM) a leading institute for professional ecologists; their competency framework, which members are required to fulfill, makes no mention of statistics or data science.↩︎
Excel will convert a data entry into a date even if it is not, e.g. an entry of “1/1” or “1-1” would return “01-Jan”!↩︎
More or Less (Spreadsheet disasters) was released by the World Service on 11 Feb 2023 and is available for over a year.↩︎
Clip board icon is in the top right hand corner of the code window.↩︎
Many R script’s are required in applying literate programming to bat data science; these are best organised through Quarto or R Markdown documents where the R scripts form code chunks.↩︎
A commodity or service that is provided without profit to all members of a society, either by the government or by a private individual or organization.↩︎