Hello,
I've downloaded the complete Particle snapshot of the TNG300-1 run, and I need to analyze it.
Currently, I'm reading the files sequentially using the hpy5 Python library.
However, a significant amount of time(~60hrs) is spent reading those 600 .hdf5 files. Is there a way to get a single .HDF5 file, or do you have any Python or C++ scripts that merge those small files into a bigger file?
Thank You
Cheers,
Bipradeep
Dylan Nelson
19 Jun '22
Hi Bipradeep,
I don't think the number of files is the issue. The performance is going to be limited by the (read speed of the) filesystem you have stored the data on. If this is 100 MB/s, and you are trying to read 1 TB, then it will simply take time. You should (i) make sure you are only loading exactly the data you need, and (ii) make sure you are using the full performance of your filesystem.
Bipradeep Saha
28 Jun '22
Hi Dylan,
You were right. Number of files were not the problem here. I was reading the data sequentially and concatenating the data to a numpy array. The concatenation part was the most time consuming part. Once I pre-allocate the arrays things were much faster.
Dylan Nelson
28 Jun '22
Hi Bipradeep,
Great - yes it's always better to pre-allocate everything, you can also look at il.snapshot.loadSubset() for other relevant details.
Hello,
I've downloaded the complete Particle snapshot of the TNG300-1 run, and I need to analyze it.
Currently, I'm reading the files sequentially using the hpy5 Python library.
However, a significant amount of time(~60hrs) is spent reading those 600 .hdf5 files. Is there a way to get a single .HDF5 file, or do you have any Python or C++ scripts that merge those small files into a bigger file?
Thank You
Cheers,
Bipradeep
Hi Bipradeep,
I don't think the number of files is the issue. The performance is going to be limited by the (read speed of the) filesystem you have stored the data on. If this is 100 MB/s, and you are trying to read 1 TB, then it will simply take time. You should (i) make sure you are only loading exactly the data you need, and (ii) make sure you are using the full performance of your filesystem.
Hi Dylan,
You were right. Number of files were not the problem here. I was reading the data sequentially and concatenating the data to a numpy array. The concatenation part was the most time consuming part. Once I pre-allocate the arrays things were much faster.
Hi Bipradeep,
Great - yes it's always better to pre-allocate everything, you can also look at il.snapshot.loadSubset() for other relevant details.
Thanks for the GitHub link, I'll check it out.