Compbiomed Data Curation Helpdesk

Data publication is crucial to ensure that research data is accessible and reusable. Data publication also enhances the research work’s visibility and impact, promoting transparency and reproducibility. In addition, publishing data is often a compliance requirement by funding agencies, as it demonstrates a commitment to open science and responsible data management practices.

Within CompBioMed, we use the B2SHARE data repository, developed by EUDAT (funded by the EU) to publish data sources. We have defined CompBioMed community-specific metadata and implemented that in the B2SHARE repository. For more information about the metadata schema, you can visit: https://b2share.eudat.eu/communities/CompBioMed

For support regarding publishing data in the B2SHARE repository, you can enquire by the EUDAT support at: https://eudat.eu/contact-support-request

If you need general advice in selecting a repository to publish your data, data formats to publish, enriching data with metadata, and data migration you can contact us through this link.

Data Curation involves moving data, and here is an short overview of tools and techniques to move data efficiently.

  • “Small Data” (Less than 1 Gigabyte):
    • Such data can be easily transported using scp or sftp, which are mature and secure protocols, perfect for transferring smaller files, offering user-friendliness and reliability. Rsync is an efficient protocol which prioritises speed by transferring only file changes, making it ideal for incremental data updates.
  • “Big Data” (Greater than 1 Exabyte):
    • High-Performance Computing (HPC) Networks, such as GEANT2, are designed for massive data movement, and this provides the necessary bandwidth and throughput for handling exabyte-scale datasets. Further, Object Storage is a cloud-based storage solution excels at handling large, unstructured data sets, making it suitable for massive data archives. Lastly, data ingestion tools, such as Apache Kafka or Apache Flume are streaming platforms which facilitate real-time data movement for large data pipelines.