Hot data and cold data
Cold data is data that is used and accessed rarely. It’s kind of like a box of documents in an attic. You might need it one day, but you almost never open it. So you’d expect retrieval times for cold data to be a little longer.
Hot data, on the other hand, is active data that is accessed often, even intensively. In this way, so-called ‘scratch’ storage spaces correspond to hot, unsaved data, ready to be used by computing or processing resources.
This notion of data temperature is distinct from the notion of storage duration.
Storage, backup and archiving
It is important to distinguish between these terms, as they represent different approaches. Generally speaking, we are talking about storage, i.e. the means / medium / place where data is kept. Backup refers to the replication of data on at least two separate media so that damaged or lost data can be restored. Backup can also be seen as the medium-term storage of data linked to completed projects (i.e. cold) but which you want to keep for possible re-use at a later date.
Archiving is the long-term storage of digital data in a secure and permanent manner, based on processes for checking the integrity of the data (and associated metadata), re-reading the content and associated metadata, and maintaining the data on a usable medium.
Some data, such as health data, are subject to storage constraints: they must be stored either in the place where they were produced (university hospitals host their own data, for example), or on an infrastructure with the ‘health data hosting’ label.
Data repositories
A data repository is a data store that allows data sets to be deposited so that they can be made visible and reused by third parties. This deposit is distinct from the concept of archiving, which places the data in a process of preservation. The use of a data warehouse should not shroud the question of how long the data should be kept.
There are many data warehouses, both disciplinary and generic (such as Zenodo for example).
Data processing
The issue of storage is very often linked to what you want to do with the data, how you want to use it. There are many possible approaches depending on the type of data, its size, etc. In particular, the categorisation of data according to whether it is structured or unstructured, i.e. organised (in a database, table or file formatted according to a precise nomenclature) or unorganised and complex (texts, extraction of data from the Web, etc.) has an impact on the tools and infrastructures to be used to exploit it.
Some examples of processing for structured data:
- Massive processing or analysis using infrastructures such as computing grids or supercomputers
- Use of automatic learning algorithms via GPU-based infrastructures
- Visualisation via dedicated tools or infrastructures, or simple consultation
- Use of specific tools via cloud infrastructures
Some examples of processing unstructured data:
- Depending on the size of the data, traversal, mining and visualisation tools that can be used locally or on cloud infrastructures, or simple consultation
- Use of automatic learning algorithms via GPU-based infrastructures
- Use of “big data” infrastructure based on tools such as the elasticsearch suite or hadoop.
All these uses can have an impact on where the data is stored.
Data volume and organisation
The volume of data to be stored fluctuates extremely from one project to another. At present, we can start to talk about large volumes of data from a few tens of TB upwards. These data sizes make data transfers over the network particularly complex.
For example, for a transfer between a storage platform located in a UGA data centre and a laptop in a laboratory (on the UGA network), the network throughput is 10 GB/s, so the transfer of 10 TB will take 1000 seconds, or almost 17 minutes (at best, depending on the network load). This is a favourable case because no firewall is crossed and the number of routers crossed is as low as possible, but the time will be much longer for the transfer to another site. The transfer time will also be strongly affected by the number of files: transferring a 1 TB volume made up of 10,000 files will take much longer than for a single 1 TB file.
Terms and conditions of access
Different ways of accessing data are possible depending on where it is stored. The choice of storage will therefore impose constraints on the way in which the data can be used. These access methods are based on very different technical protocols and storage technologies.
It is therefore important to identify the needs of each project before making any technical choices:
- How do we want to access the data?
- Who will access the data? With what rights (read, modify)?
- What should this access to data enable: processing, analysis, manipulation, distribution?
From the user’s point of view, the following types of access are possible:
- Web access, where users access their data directly via their browser.
- Access via the manual network: users download their data onto their machine either via a graphical client or using the command line. They can also set up a network drive.
- Automatic access via the network (the system takes care of the connections transparently), with the user viewing their data directly from their computer.