Title: “Perhaps we have framed the research data management challenge incorrectly?”
UCB iSchool Friday Afternoon Seminar on Information Access
Speaker: Clifford Lynch
3:25 PM - 5:00 PM
South Hall 107
Approximately 24 people in the room, mostly librarians from California Digital Library.
Descriptive Notes:
- Since 1980s, I have been concerned with research data management.
- Libraries have to spend time stepping up to research data management
- 20 years later, this has come true.
- Research funders began to get interested in research data as an asset that needs to be taken care of; preserved; made available for sharing and reuse.
- Setting up environments could promote the progress of science; enable reproducibility to allow easier confirmation of results; reanalysis of data that comes out of experimental work; “meta-analysis” thinks in terms of - can you combine relatively commesurate data and combine to better understand phenomenon
- Funders now requiring sharing of data as long as balances with human subjects and privacy.
- “Nasty set of tensions running now with people ever more concerned with privacy and consent and narrow scoping of reuse and people who are very driven by idea of open science. In any event, funders have become quite serious about requiring researchers to document how they intend to document and what they intend to make available.”
- This march to Open Science is not capturing the whole thing of what is going on. Early warning signals: talking to colleagues in late 2000s about Sloan Sky Survey - whole sweeps of the sky at various wavelengths. Made huge dataset out of it. Astronomy used it as primary reference point. Repurposed at Microsoft Research as a virtual observatory that high school students could navigate around.
- Property of this - wasn’t really a dataset, was an active data system. When you look at discussions of preserving it - “can we just take data to park it on the shelf indefinitely and pay for it indefinitely” didn’t capture correctly the role of that data and the way it was used.
- Biomedical information: cost of preserving and providing access to biomedical data. Long history of these community information systems. Some of them are big/smaller and nichier. Precarious funding for the smaller ones. “How do we finance it going forward?” Funding in grant is usually not research, it’s operations.
- Gen Bank - funded by national institute of health. Some kind of gov’t funding earmarked.
- NSF funded research. Grant runs out and NSF says - we are in the business of funding research; once by-products get out, we just wash our hands of it; somehow the community is supposed to figure out what to do with it.
- Examples like Life Sciences - Plant genome work that got defunded and is now a private consortium; Protein Data Bank - surviving somehow as an ongoing research project because so many people rely on it. It has managed to get renewed NSF funding
- Characteristic of these systems - aggregated in nature. Labs spread around the country. Those labs collecting data; Then feed it into these aggregating platforms.
- Data sharing is very driven by community platforms. Basically are missing from the whole way we have been thinking about research data and lifetime guarantees for data; DMPs, etc. “I’ll make this data available for sharing and keep it for 20 years”
- World of DMPs and platforms interacts is very strange. If a researcher is writing grant that has a natural home in one of these, they will punt it off - “of course we will be sequencing genes and put it in genbank”. Folks reviewing it will say okay, sounds good. But the whole question of the long-term availability of things are punted over to the question of what is the survival prognosis of Genbank?
- NSF is funding something in 2021 Synoptic Sky Survey - much bigger array and data intensive sky imaging. Produces a terrifying number of terabytes per day of data which will be a community data system. Big piece of the budget is data management. But when the project is over, what happens to it?
- Argument: If you look at actual scholarly practice - big disconnect between the way research data management plans and conversation about project oriented curation happens, “we collect up data, curated it, and then when all done, find appropriate repo and park it with meta data and then we are done and only question going forward - how do you (pre)finance repository which has an ongoing expense?”
- That kind of model is deeply embedded in current discourse regarding DMPs but disconnected with what is actually happening in the field and scholars practice.
- Plenty of examples of this although in less organized basis in humanities as well. DH is rife with this sort of thing.
- Issue in humanities is that they tend to be more idiosyncratic so communities are smaller and the way they are contributed are more complicated. From websites to scholarly info systems… we don’t have good terminology here. That is an ongoing kind of weak spot. If you look at some of the communities of circles of writers or groups trying to understand migrations, slave trade, spread of ideas and practices across regions, some of these straddle the humanities, social sciences and hard sciences in really interesting ways. See these communities pulling together stuff from multiple areas. (e.g. when and where were cattle domesticated)
- Dominant model - we have FTP. If you want to use that data; end up pulling it back locally. The problem is that now sometimes data is very big and sometimes your compute resources aren’t local.
- Where does data live and where does computation live? How do you get the computation and the data in the same place? How do you think about the costs of moving large sets of data around?
- E.g. Discussion about “if we put data in cloud, then we have access and e-gress charges means it could be very expensive” but if you want to make data available, then there are pilots where you take the data, put it on Amazon or Azure as public data then say to researchers; buy your own cycles from AWS or Azure, etc. Don’t have to deal with the e-gress charges and gives tidy way for data providers to make the data available without you having to incur computational cost. Flip side, I am sort of selecting or privileging specific computational cloud.
- Those who want to do high performance computing - might want to store data close to them. Other set of data conceptual models - might have big data but the way you are going to reuse it is just very trivial.
- Digital Libraries: The world of knowbots. Vince Cerf and Bob Kahn. https://www.cnri.reston.va.us/kahn-cerf-88.pdf
- Scholarly support systems need to accommodate for different
- Privacy and data liability. More and more cases where folks are concerned about deidentifying data. Increasingly not confident that they can make a deidentified data set available because even if they have deidentify it, people can re-idnetify it. So get around that by saying I am not going to make any data available; I will let you come to me and perform constrained set of manipulations on the data and take the results back. Fascinating conversation - if you follow that model, how do you persuade your security people that you are not opening a massive attack service for your data? How do you have confidence in the sandboxing and control? If you look at existing practice, many of them are quite draconian. In social sciences like census, criminal justice, notions of secure data enclaves that are run by government agencies or their contractors. Idea that researcher could go to the secure enclave and run computations there. Code they run is inspected, data they want to take out is inspected and that kind of really labor-intensive process is how people are dealing with sandbox problem. Not good scaling properties. That is how we are seeing the reality of cloud and increasing concerns of cloud and liability. Senses that people want to say - no good language for this - “I’m willing to share my data but only under certain restrictions.” That is what making available anonymized versions of your data is doing.
- FAIR principles - Findable, Accessible, Interoperable, Reuseable. “Who could be opposed to FAIR data?”. Gaining lots of traction over last 2 years in Open Science.
- Trouble comes when you try to figure out what the heck any of those terms means…Good papers and analyses trying to pick these things apart.
- Findable is tricky.
- It means - “if I have a known item like dataset that is underlying data of research report that has been published, let’s not put a URL on the paper and hope it stays forever. There are mechanisms like DOIs and data cite. A piece of “findable”.
- Unknown item searching. I am looking for datasets on salinity in pacific oceans in last 30 years. Trouble with that is that usually you have fairly precise thing you want and mismatch is in descriptive practices.
- Literature from 1960s - bibliographic classification systems and how they work on online catalogues, etc. There is a ton of literature there and none of that work has even been attempted for research data. Frightening thing is we are watching people wandering around - “the thing to do is build meta-catalogues (same sense of Data One). Let’s make it even bigger! Throw humanities data, eco data, etc. It will just be wonderful.” First intuition - what could possibly go wrong there.
- Comment from audience.: “Decontextualized bodies of data. The social scientists would just be saying “eek!!!” Context is so important to them and this is an example of taking it a step further. Clearly needs more sophisticated approach.”
- In order to aggregate the data, need to agree on standards. Community data bases come with impulse to standardization of practice. When you look at datasets that sit on the shelf, the kinds that don’t fit into this model.
- Kris - environmental ecologists. 40 ways to take the data and whether you will reuse is absolutely dependent on knowing how it was measured and collected. But if you go to the fairly superficial meta-data, it probably doesn’t tell you enough to do that. If you are lucky, it has a codebook. But the scaling properties there are very bad.
- Audience: That’s true of research design in general!
- Clinical trials readouts. In-depth looks at how the data was measured in clinical trials. So these people normalize the outcomes of clinical trials so people can compare them or do meta-analysis on them in meaningful ways. Niche ways in how there is money in how they are cleaning this up. Very rare.
- ACCESSIBLE: how do I find it? DOI - in theory you can resolve this. If you are a machine, DOI might put you on a landing page or it might take you to dataset itself. There are a couple of things in play:
- Big difference in accessible by humans vs machines.
- It’s not necessarily helpful to leave stuff lying around with no indication of who owns it or can do what with it. Good practice to make that clear up front.
- Side discussion - is data copywritable, etc. But for purposes of this discussion, it should be clear up-front what the rules of the road are.
- INTEROPERABLE: can I tell how to download it to what is the structure of the data.
- REUSEABLE: Am I confident I can interpret what the structure of the data is?
- We have limited understanding of how to operationalize these principles.
- Barend Mons: https://www.mitpressjournals.org/doi/pdfplus/10.1162/dint_a_00002
Implications:
- Ongoing curation of the semantics of data (which don’t stay stable).
- In practice - once goes to the repo, no money for updating it, documented once and done.
- We know that things don’t stay stable. Names of proteins, genes, etc. change.
- As a scientific community struggling to understand something and then they come as a community to name it. E.g. AIDS. For that particular body of literature, people have gone back and cleaned it up. But are we going to do that for all of scholarship? What are the locuses of activities there? Who is going to do it? Doesn’t align with organizations except in weird case of biomedical things because of standing investment there.
- University libraries - what is their role?
- Amazingly compromised. Supposed to be serving assortment of disciplines and if you look at the resourcing levels, no way any individual library can do this support for all of the disciplines represented.
- What are the politics of this? Politics of disciplines.
- Digital Curation Network (https://datacurationnetwork.org/) - superficial level - sharing expertise to do the kinds of things we do today. They are not saying we need a sustained 20 million dollar a year investment forever. Can make an argument for consortia. Institutions with matching funding from science funders? But the politics get vicious. Doesn’t lend itself to doing on a distributed basis.
- Farmington Plan (https://www.britannica.com/topic/Farmington-Plan / https://en.wikipedia.org/wiki/Farmington_Plan) research libraries divided up. Universal coverage. Then we do interlibrary loan. Worked when libraries had more money then they knew what to do with. Died in 1970s. No university willing. Budget was for local.
- There are some fields that have invested heavily in the resource and hav been able to accelerate the pace of scientific advancement (synergy between biotech and pharmaceutical industries; prospects for precision medicine). Never seen good effort on public policy basis analyze - “over the last few decades we have invested XXX USD in XXX, how is the quality of science in that area where we have made minuscule investments.” Nonetheless, I think many of us have intuitive belief that sustained investment in systematic knowledge management has a payoff and not doing that has a detrmint. But how we substantiate it is hard.
- Data reuse regimes:
- 1) world of these repositories where data is described and sits on a shelf. Humans in the loop fundamentally, interpreting what to use.
- 2) if we could describe things better, you might be able to do more automated reuse
- 3) active, human-driven semantic aggregation. Reuse is implied in the operationalization of the system.
- 4) Investments to make the knowledge abstracted. Investments in gene expression ontologies, disease models. Issue is you are paying people who understand knowledge representation and content to spend time to codify the data. We don’t know funding mechanisms for that or payoff. Getting to that place is the vision that a number of people talking about future of scholarship are trying to head but economic and organizational factors seem totally disconnected from the vision. Implications for both disciplines and institutions are fascinating.
Q&A:
- Audience: We need more self-conscious labeling of the information professions. Huge body of knowledge that is around the management of information. The framework of the people in the field vs information professions is fundamentally different. The information science that exploded in the 60s that was such an exciting field never received its full maturation.
- Clifford: fault line down the side of that. Dream that info science is a science with its universal principles but lots of evidence that the most effective knowledge structuring is coming very close to the disciplines. Where we are today and future - we don’t understand how to go from discipline independent field to what the disciplines need and how we understand that split line is a huge and untouched area. That is exactly what people in the iSchool world should be focusing on.
- Communities of data. Big gap in the way data is preserved and approached.
Critical Commentary
AO: These are my notes from a public event at the University of California Berkeley's Information School on January 24, 2020 where Clifford Lynch spoke on research data management during the regular seminar on Information Access.