Loading subsample of all particles of given type

Yesukhei Jagvaral
  • 24 Mar '21

Hello Dr. Nelson,

I am little bit confused about how to load subsample of DM particles. For example, i can load all the DM particles with >>> dm_pos = il.snapshot.loadSubset(basePath,99,'dm',['Coordinates']);. However for memory and speed issues i want load only maybe 10% of the particles. How can that be achieved? The keyword 'subset' from
def loadSubset(basePath, snapNum, partType, fields=None, subset=None, mdi=None, sq=True, float32=False): seems little bit confusing to me. I thought this 'subset' keyword took in list of index arrays of DM particles, but it seems to take in a dictionary of offests?

Thanks!

Dylan Nelson
  • 24 Mar '21

Hi Yesukhei,

The subset parameter of snapshot.loadSubset() can indeed be used for this. It should be simplified more, but for now I attach some example code of how to sub-divide a load into a number of smaller loads:

import h5py
import numpy as np
import illustris_python as il

def pSplitRange(indrange, numProcs, curProc, inclusive=False):
    """ Divide work for embarassingly parallel problems. 
    Accept a 2-tuple of [start,end] indices and return a new range subset.
    If inclusive==True, then assume the range subset will be used e.g. as input to snapshotSubseet(),
    which unlike numpy convention is inclusive in the indices."""
    assert len(indrange) == 2 and indrange[1] > indrange[0]

    if numProcs == 1:
        if curProc != 0:
            raise Exception("Only a single processor but requested curProc>0.")
        return indrange

    # split array into numProcs segments, and return the curProc'th segment
    splitSize = int(np.floor( (indrange[1]-indrange[0]) / numProcs ))
    start = indrange[0] + curProc*splitSize
    end   = indrange[0] + (curProc+1)*splitSize

    # for last split, make sure it takes any leftovers
    if curProc == numProcs-1:
        end = indrange[1]

    if inclusive and curProc < numProcs-1:
        # not for last split/final index, because this should be e.g. NumPart[0]-1 already
        end -= 1

    return [start,end]


def loadSubset(simPath, snap, partType, fields, chunkNum=0, totNumChunks=1):
    """ Load part of a snapshot. """
    nTypes = 6
    ptNum = il.util.partTypeNum(partType)

    with h5py.File(il.snapshot.snapPath(simPath,snap),'r') as f:
        numPartTot = il.snapshot.getNumPart( dict(f['Header'].attrs.items()) )[ptNum]

    # define index range
    indRange_fullSnap = [0,numPartTot-1]
    indRange = pSplitRange(indRange_fullSnap, totNumChunks, chunkNum, inclusive=True)

    # load a contiguous chunk by making a subset specification in analogy to the group ordered loads
    subset = { 'offsetType'  : np.zeros(nTypes, dtype='int64'),
               'lenType'     : np.zeros(nTypes, dtype='int64') }

    subset['offsetType'][ptNum] = indRange[0]
    subset['lenType'][ptNum]    = indRange[1]-indRange[0]+1

    # add snap offsets (as required)
    with h5py.File(il.snapshot.offsetPath(simPath,snap),'r') as f:
        subset['snapOffsets'] = np.transpose(f['FileOffsets/SnapByType'][()])

    # load from disk
    r = il.snapshot.loadSubset(simPath, snap, partType, fields, subset=subset)

    return r
Yesukhei Jagvaral
  • 7
  • 24 Mar '21

Thank you for the response Dr. Nelson, but i think this above example still loads all the particles, correct?
Let me be more precise, i am trying to down sample the DM particles in order to use them as traces of the matter density field. So, it would be great for me if i could just randomly select and load 1 million DM particles out of the 6 billion in TNG100. Is there an easy way to do this?
Like for example with subhalos i would do:

rand_idx = np.choice(subhalos_num,1000) # where subhalo_num is the total number of subhalos in the snapshot
list_of_gal_pos = []
for i in rand_idx:
gal_pos= loadSingle(basePath, snapNum, i, 4, fields=['SubhaloPos']) # get the position of every galaxy within rand_idx and append it to the list
list_of_gal_pos.append(gal_pos)
Here, i would end up with 1000 randomly sampled galaxy positions. I want to do the same thing but with the DM particles. I hope this was clear.

Dylan Nelson
  • 24 Mar '21

Hi Yesukhei,

The particles are ordered according to halo, so in order to select "random" particles from across the entire simulation, you would need to randomly load across the snapshot. This won't really work as it is just small random reads.

The example I posted above should be used e.g.

nSubLoads = 100

for i in range(nSubLoads):
    data = loadSubset(..., chunkNum=i, totNumChunks=nSubLoads)
    # do something with data, which contains 1% of the particles

I would load all the data, no need to downsample. In such a loop you will never have more than 1% of the particles in memory at once, which should fit anywhere fine.

  • Page 1 of 1