Illustris - Analyzing the full TNG300-1 Snapshot

Bipradeep Saha

18 Jun '22

Hello,
I've downloaded the complete Particle snapshot of the TNG300-1 run, and I need to analyze it.
Currently, I'm reading the files sequentially using the hpy5 Python library.
However, a significant amount of time(~60hrs) is spent reading those 600 .hdf5 files. Is there a way to get a single .HDF5 file, or do you have any Python or C++ scripts that merge those small files into a bigger file?

Thank You

Cheers,
Bipradeep

Dylan Nelson

19 Jun '22

Hi Bipradeep,

I don't think the number of files is the issue. The performance is going to be limited by the (read speed of the) filesystem you have stored the data on. If this is 100 MB/s, and you are trying to read 1 TB, then it will simply take time. You should (i) make sure you are only loading exactly the data you need, and (ii) make sure you are using the full performance of your filesystem.

Bipradeep Saha

28 Jun '22

Hi Dylan,
You were right. Number of files were not the problem here. I was reading the data sequentially and concatenating the data to a numpy array. The concatenation part was the most time consuming part. Once I pre-allocate the arrays things were much faster.

Dylan Nelson

28 Jun '22

Hi Bipradeep,

Great - yes it's always better to pre-allocate everything, you can also look at il.snapshot.loadSubset() for other relevant details.

Bipradeep Saha

29 Jun '22

Thanks for the GitHub link, I'll check it out.

Public Data Access Overview / Discussion Forum

Analyzing the full TNG300-1 Snapshot