I was interested to read Chris Sexton’s blog on the discussions at today’s RUGIT meeting. One particular topic that caught my eye was on research data management. There are, as the RUGIT meeting picked out, a number of challenges. Firstly there is the volume of data. This was something that Guy Coates from the Sanger Institute picked out at last month’s Eduserv Symposium. The Sanger Institute generate enormous volumes of data and accurately curating it for future use is a real difficulty. One of the lessons they have learned is that it isn’t necessary to keep all the data all of the time – there is a need to regularly carry out a data triage to assess what is and isn’t going to be useful going forward. One point to pick out is that the work of the Institute is focusing on the human genome and that there are better sources that emerge as technology moves on – this certainly helps make the decision on which data to delete. Coates wasn’t the only one at the Eduserv Symposium to suggest deleting data – Anthony Brookes from the University of Leicester highlighted the difference between data and knowledge. He suggested that it was better to focus on the knowledge acquired from the data rather than slavishly pursue the curation of the data itself. Both speakers identified potential solutions to reducing the data storage burden although it is worth pointing out that neither is entirely risk free. Once data is deleted, it is gone and as one speaker at the Symposium identified, no one will thank you for deleting the wrong set of data. However, given the rate at which the volume of research data is expanding, a pragmatic approach to data curation seems more appropriate than long term storage.
Another challenge is cost and it is, as Chris points out, a delicate balancing act. There have been a number of studies that have shown that a frighteningly large amount of research data is held on what I would call non-secure media – data that is backed up to memory sticks, CDs, removable hard disks etc. Many IT service departments have spent a considerable amount of time wooing their researchers to store their data centrally to ensure that it is backed up regularly and so the risk of loss of data is greatly reduced. Charging researchers for keeping their data over an extended period may well reverse the trend and, as Chris points out, result in more academics purchasing portable hard drives from their friendly High Street supplier without consideration of back up, archiving and future use. It is a real problem. I was on the Steering Board for one project considering a cloud based solution for data storage and it was clear from the research administration input to that Board that, whilst the cost of data storage during a research project could be factored into a research grant bid, it was felt that the long term storage could not. So it is not just a balancing act for the IT department but for researchers too. They need to identify, as part of the data management plans required by some of the Research Councils how they can meet the needs for long term, accessible data storage.
The UK Research Data Storage project picked out that long term data curation was not an IT problem – the greater challenges were to ensure that the data that was stored was tagged appropriately so it could be interpreted and reused and to provide the incentive to researchers to tag their data at all. That may not be the whole truth now. Storing the volume of data is a challenge but so too is the need for a regular assessment of what is stored in order to determine what has been superceded and so can be deleted, assessment of what can be removed because the information derived from the data has greater value than the raw data itself, and accurate tagging to allow such assessments to be made. So whilst there may be benefits of collaboration between institutions to store the increasing volumes of research data, the bigger challenge in my view is changing the culture and behaviour of researchers.