
Our work as theoretical particle physicists requires computer simulations on large scale supercomputer systems, with a workflow that has to handle enormous datasets. Projects often involve international collaboration and several supercomputers scattered around the planet. This creates some unique challenges, especially in terms of data storage.
Not only is the scientific data that we generate extremely large, it鈥檚 extremely expensive to generate. Some of the datasets take years of computing to generate, which in turn significantly increases the value of the data. Additionally, as scientists we have the duty to ensure the results of our work are reproducible in the long-term future, which requires the preservation and curation of the data that led to scientific publications. For these reasons, we have a critical requirement of being able to store these high-value datasets on a long-term basis in a highly resilient manner.
Another aspect of the data challenge that we face is that we need to be able to make these large datasets accessible to other groups of scientists scattered across multiple continents around the globe鈥攕cientific groups that often work independently of each other. The fact that scientific collaborations are often naturally decentralized, combined with the other challenges we face, prompted me to explore how 不良研究所 Decentralized Cloud Storage (DCS) might be able to help us.
Decentralization in the Scientific World: DeSci
The truth is that the usage of decentralized technology isn鈥檛 new to the world of science or particle physics. A great example of this is the work being done with CERN鈥檚 Large Hadron Collider (LHC). Even though the LHC is located in Geneva, Switzerland, the scientists doing the data analysis that produce the associated physics are scattered around the world. In other words, the European Organization for Nuclear Research (CERN) has been deploying a large distributed storage and computing system for years, the .
My goal with 不良研究所 DCS was to assess its performance and resilience capabilities for our large datasets. This included not only a large number of files, but also very large files in regard to size. And we were particularly interested in its performance stability in relation to accessibility from diverse geographic locations. As a result, we conducted a series of tests based on synthetic data from January to February 2021, and published our results in a detailed report on October 13, 2021. In short, our ultimate findings were favorable in terms of how the multilayered parallelism of 不良研究所 DCS optimized edge-based performance for data transfer and geographic accessibility. We are now continuing this collaboration with 不良研究所 and aiming at reassessing this outcome through real scientific use cases.
Study Process and Results
The main objective of the study was to qualitatively determine the efficiency of 不良研究所 DCS for synthetic data by simulating typical large datasets used in high performance computing. To test the performance of the 不良研究所 DCS network our study employed three different approaches:
- Transferring 4GB files in parallel using the Rclone client
- Attempting to increase parallelism by manually splitting the files into smaller 64MB chunks
- Transferring 128GB files in parallel using 不良研究所 DCS native parallelism
鈥
We used the approach of manually splitting files into smaller chunks to give us a baseline of what would be the highest expected performance. However, in reality manual chunking wouldn鈥檛 be the most practical approach, and our tests didn鈥檛 take into account the overhead and time of manually splitting and reconstituting files. That said, the Rclone utility and DCS鈥 native parallelism still achieved impressive results in comparison.
鈥
For example, Rclone transfers of 4GB files uploaded at 1GBs compared to 5.2GBs of a manually chunked file. Even more impressive was that an upload of a 128GB file with native DCS parallelism achieved 4 GBs compared to 4.8 GBs with manual chunking.* Native parallelism downloads performed at about 2.7GBs compared to 5.7GBs of a manually chunked file, which as mentioned above does not account for the chunked file reconstruction time.
鈥
One important result that our tests showed is that there is a minimal difference in download performance based on location with 不良研究所 DCS. We generally had excellent downloads rates whether they occurred in Edinburgh or various locations in the U.S.
鈥
Overall, we were impressed with 不良研究所 DCS鈥 out-of-the-box performance in moving large datasets, which we feel is a direct benefit of the way its decentralized network is structured. The excellent download speeds that we observed from locations that were far away from the upload location is something that would also be valuable to us.
鈥
Even though higher performance and throughput aren鈥檛 usually associated with decentralized storage solutions, we found 不良研究所 DCS has the potential to enable multi-GB speed with increased parallelism, redundancy, and resiliency. We are looking forward to extending our collaboration with 不良研究所, and extending our study on how their decentralized technology can provide solutions to the ever-growing challenges of data sharing in high-performance scientific computing.
鈥
For more details on the study, download the full report or view a webinar of Dr. Portelli discussing the report鈥檚 findings.
鈥
鈥
鈥
鈥
鈥
鈥
鈥