How to Avoid Disaster: The Importance of Responsible Data Management and Education
Recent Articles
Categories
From photographers losing thousands of original photographs to the programmers of Toy Story 2 nearly deleting the entire movie file with an incorrect command, the impact of poor data management and education can be both far-reaching and irreversible. Despite the digitization of analog research data making the collection, processing, and analyses of information easier than ever, errors in these processes can create copious problems for analysts—ranging from data loss to the retraction of publications. Moreover, the majority of researchers are never actually taught the basic components of responsible data management and research computing, even though the use of big data research methods in both academia and industry has essentially exploded over the past five years.
Dr. René Malenfant—an instructor in the Department of Biology at the University of New Brunswick—completed an undergraduate degree in computing science from St. Francis Xavier University before moving on to study biological sciences at the University of Alberta. One of the classes he teaches at UNB—Practical Computing for Biologists—focuses on addressing the gap between education and application pertaining to responsible data management and analyses. “Science is becoming increasingly computationally intensive, and datasets are getting larger and larger. Students and researchers need to know how to store and manipulate these datasets, which cannot be handled in the same way as small text or Excel files”, says Dr. Malenfant of the importance of good data management practices. Moreover, “some researchers claim that there is a `replication crisis` in many fields of science—meaning that when people try to repeat a study to verify its results, they fail”.
Data management is an umbrella term that encompasses both basic operations such as storing, moving, and saving information to filtering, editing, and manipulating it with a variety of complex formulas. A consecutive sequence of operations—or a pipeline—can also be used to analyze a dataset, wherein the output of an individual function becomes the input of a subsequent action or string of operations. Certainly, with pipelines and even with singular, basic commands, it is imperative to the integrity and adequate analysis of scientific data that researchers are sufficiently practiced in spotting errors, making informed statistical decisions, and ensuring that their workflows are reproducible.
So how can researchers safeguard against errors in data processing, mass deletion of digitized information, and irreplicable analyses? According to Dr. Malenfant, making scientific data publicly available is one way of helping others confirm the results of a study or to find errors in the initial data analysis. Additionally, most big data analyses are performed using programs that process commands in the form of text—or command line interfaces, like R or AutoCAD. Programs such as Microsoft Excel, on the other hand, operate using graphical user interfaces that prompt users to click on buttons to perform data manipulations like sorting or filtering. Publicizing the code from programs that use a command line interface to analyze data in addition to the dataset itself is an additional step that researchers can take to detect errors and increase the replicability of their work, says Dr. Malenfant. This is especially important in academia as most hands-on research is done by graduate students and postdoctoral researchers who are temporary members of a lab—often generating datasets and leaving them behind for future lab members to analyze. “Ideally, any member of a lab should be able to open a dataset, understand what they are looking at, and start using it”.
Citing COVID-19 restrictions as a particular source of difficulty that has prevented many students from working in the field or in a laboratory, Dr. Malenfant recognizes that many have been forced to rely on data previously collected by other researchers in their lab groups or gathered from the web. To this end, data security can become a significant concern when encountering publicized information online or contributing to virtual databases. According to Dr. Malenfant, some data are sensitive and need to be protected, such as patient data or information on endangered species that may reveal their locations to poachers, for example. Such information needs to be “appropriately anonymized or censored if it is to be released” in order to be made safely accessible to the public.
Finally, there is an increasing demand for various granting agencies that distribute funding for scientific research—the Natural Sciences and Engineering Research Council of Canada (NSERC) or the Canadian Institutes of Health Research (CIHR), for example—to require that researchers submit data management plans with their grant applications. Moreover, researchers are required to make the data that they have collected available after the publication of their projects. “Even if researchers are skeptical about the value of data management and deposition, these requirements will soon be forced upon them”, says Dr. Malenfant regarding the changes to standard data management practices that the scientific community is currently experiencing. If the threat of mass data deletion, incorrect analyses, and coding errors is not great enough to prompt researchers to invest in sufficient data management education and practices, evolving conventions may strongarm the vast majority in the coming years.
Programs such as Apple’s “Time Machine” essentially allow users to duplicate the contents of their computers and store them on an external hard drive that fits in the palm of their hand; whereas other features, such as the shell, hold the power to obliterate every document, photo, and factory setting with a single letter. Canadian computer scientist Alfred Aho says that the management and analyses of data is a “science of abstraction—creating the right model for a problem and devising the appropriate mechanizable techniques to solve it”. Through the formidable technological advancements that the scientific community continues to conceive, these techniques are continuously being revolutionized and reimagined every single day. From a computer the size of a classroom to one that you can wear on your wrist, the whirlwind evolution of technology shows no signs of slowing—emphasizing the vital role of smart and secure data management, manipulation, and storage.
Author
References:
Eaker, C. 2016. “What Could Possibly Go Wrong? The Impact of Poor Data Management”. Tennessee Research and Creative Exchange.
Hart, E., Barmby, P., LeBauer, D., Michonneau, F., Mount, S., Mulrooney, P., Poisot, T., Woo, K., Zimmerman, N., Hollister J. 2016. “Ten Simple Rules for Digital Data Storage”. PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1005097.
Noble, W. 2009. “A Quick Guide to Organizing Computational Biology Projects”. PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1000424.
Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., Teal, T. 2017. “Good Enough Practices in Scientific Computing”. PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1005510.
Zook, M., Barocas, S., Boyd, D., Crawford, K., Keller, E., Gangadharan, S., Goodman, A., Hollander, R., Koenig, B., Metcalf, J., Narayanan, A., Nelson, A., Pasquale, F. 2017. “Ten Simple Rules for Responsible Big Data Research”. PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1005399.