Merger catalogs with merger and progenitor information

Rebecca Nevin
  • 5 Oct '20

Here's the situation - I'm on the hunt for mergers in TNG50/TNG100.

However, I'm finding that the organization of the merger trees is not ideal for my specific situation. I believe that what I'd like is more of a horizontal organization. So I'm thinking that making my own catalogs is the best way forward, but I'd like a second opinion on this.

The definition of a merger from Vicente's 2015 sublink paper is where a subhalo has more than one direct progenitor, aka a next progenitor to its first progenitor exists. I'd like to be able to select a snapshot, search across all subhalos/galaxies within a certain range of masses, and select those that have more than one direct progenitor. I'll define these as mergers, and I'll build a catalog of the IDs and mass ratios of these mergers as well as the identifiers of the direct progenitors. I'll repeat for various redshift bins, so I need to be able to search horizontally across all subhalos that meet certain requirements create catalogs of mergers this way. I'm imagining these catalogs to be 'all major mergers between z = 2 and 3 in TNG50 where the stellar mass is between a range' and each entry of the catalog gives the ID, snapnum, stellar mass, progenitor IDs and snapnum, and mass ratio of the merger.

I'm aware of the following resources that might be useful to create these catalogs:
1) Sublink trees. However, this seems like way too much information, because I'd really just like to go back a few snapshots in time, I don't need the full history for a given subhalo across all moments of time. Is there a way to vertically trim trees? Currently, it's just taking way too long to load in even a single tree for a given subhalo, when I'll need to do this for a large number of subhalos.

2) The discussion with Zephyr Penyore about simple.json merger catalogs (filed under "Streamlined access to API data") - I've read through this discussion and found it helpful and I've spoken with Zephyr about this data product. However, I would like to be able to restrict this catalog by various things like redshift range, mass ratio, and stellar mass. One path forward could be to re-write this code to do this for the TNG runs (it looks like currently the code is for original Illustris), but again, this would still be dealing with the full merger trees which causes the same slow down as #1.

3) The 'subhalo catalogs' as I'm calling them (sub = get( subs['results'][1]['url'] )), not sure what the actual name is, have the ['related'] field which gives the progenitor id and 'prog_sfid', which would be useful for querying the tree. I wanted to check that I'm not missing some sort of info about next progenitor id because that could be a useful tool for directly using the subhalo catalogs to search by subhalo for progenitors. One solution to my dilemma would be to somehow add a 'nextprog_sfid' to each subhalo. I also keep getting confused about next progenitor. Given my understanding of the merger trees, this would be added to the first progenitor's subhalo catalog. However, given an observational approach to mergers, this could be useful added to the descendent subhalo's catalog, but should probably be renamed something other than next progenitor, i.e., 'thing that merged with first progenitor to form this subhalo.'

I was hoping to get your opinion on which of these options is the most promising path forward and was also hoping to avoid reinventing the wheel. I would of course be happy to contribute to a new type of merger catalog value added catalog for TNG.

Finally, I'm aware that the definition of a merger is a little tricky, since subhalo particles can be lost during a merger. This would mean that the moment of merging should probably be defined as the moment of the maximum mass from the mass history of a halo. I'm also aware that I've only been considering binary mergers to date. I would be happy to discuss either of these considerations.

Dylan Nelson
  • 6 Oct '20

Hi Rebecca,

It sounds like you want a "catalog of galaxy-galaxy mergers" (containing similar data as (2), but for all subhalos). This can be derived, under many different definitions etc, from the merger trees. But you're right, there isn't such a catalog already easily available. As for (3), this does not exist, I have only included the direct/first progenitor/descendant links into the API fields.

So I think the best approach would be (1), e.g. derive a galaxy merger catalog from the merger tree. You don't have to load trees one at a time, for instance I sometimes use a function like the one below, which can load all the trees of all subhalos at once. This might take a minute or two, but after that, you can process everything in memory and derive a complete merger catalog (very quickly).

def loadMPBs(sP, ids, fields=None, treeName=treeName_default, fieldNamesOnly=False):
    """ Load multiple MPBs at once (e.g. all of them), optimized for speed, with a full tree load (high mem).
    Basically a rewrite of illustris_python/sublink.py under specific conditions (hopefully temporary).
      Return: a dictionary whose keys are subhalo IDs, and the contents of each dict value is another
      dictionary of identical stucture to the return of loadMPB().
    """
    from glob import glob
    assert treeName in ['SubLink','SubLink_gal'] # otherwise need to generalize tree loading

    # make sure fields is not a single element
    if isinstance(fields, str):
        fields = [fields]

    fieldsLoad = fields + ['MainLeafProgenitorID']

    # find full tree data sizes and attributes
    numTreeFiles = len(glob(il.sublink.treePath(sP.simPath,treeName,'*')))

    lengths = {}
    dtypes = {}
    seconddims = {}

    for field in fieldsLoad:
        lengths[field] = 0
        seconddims[field] = 0

    for i in range(numTreeFiles):
        with h5py.File(il.sublink.treePath(sP.simPath,treeName,i),'r') as f:
            for field in fieldsLoad:
                dtypes[field] = f[field].dtype
                lengths[field] += f[field].shape[0]
                if len(f[field].shape) > 1:
                    seconddims[field] = f[field].shape[1]

    # allocate for a full load
    fulltree = {}

    for field in fieldsLoad:
        if seconddims[field] == 0:
            fulltree[field] = np.zeros( lengths[field], dtype=dtypes[field] )
        else:
            fulltree[field] = np.zeros( (lengths[field],seconddims[field]), dtype=dtypes[field] )

    # load full tree
    offset = 0

    for i in range(numTreeFiles):
        with h5py.File(il.sublink.treePath(sP.simPath,treeName,i),'r') as f:
            for field in fieldsLoad:
                if seconddims[field] == 0:
                    fulltree[field][offset : offset + f[field].shape[0]] = f[field][()]
                else:
                    fulltree[field][offset : offset + f[field].shape[0],:] = f[field][()]
            offset += f[field].shape[0]

    result = {}

    # (Step 1) treeOffsets()
    offsetFile = il.groupcat.offsetPath(sP.simPath,sP.snap)
    prefix = 'Subhalo/' + treeName + '/'

    with h5py.File(offsetFile,'r') as f:
        # load all merger tree offsets
        if prefix+'RowNum' not in f:
            return result # early snapshots, no tree offset

        RowNums     = f[prefix+'RowNum'][()]
        SubhaloIDs  = f[prefix+'SubhaloID'][()]

    # now subhalos one at a time (memory operations only)
    for i, id in enumerate(ids):
        if id == -1:
            continue # skip requests for e.g. fof halos which had no central subhalo

        # (Step 2) loadTree()
        RowNum = RowNums[id]
        SubhaloID  = SubhaloIDs[id]
        MainLeafProgenitorID = fulltree['MainLeafProgenitorID'][RowNum]

        if RowNum == -1:
            continue

        # load only main progenitor branch
        rowStart = RowNum
        rowEnd   = RowNum + (MainLeafProgenitorID - SubhaloID)
        nRows    = rowEnd - rowStart + 1

        # init dict
        result[id] = {'count':nRows}

        # loop over each requested field and copy, no error checking
        for field in fields:
            result[id][field] = fulltree[field][RowNum:RowNum+nRows]

    return result
Rebecca Nevin
  • 8 Oct '20

Hi Dylan, this looks great and I'm incorporating it into my code, thanks for the quick response!

I have a couple of questions that have come up as I do this. First, is there a 'loadMPB' function in the sublink.py code? I want to be able to compare this new function to the original and see how it works in context of offsets and group catalogs.

Second, and this is probably related to my first question, I'm now getting confused about the 'ids' variable that is the input to loadMPBs. Are these subfind IDs or subhalo IDs? Right now I'm using the group catalogs to get all subhalos at redshift 4, and I'd like to be able to work off of the variable 'SubhaloIDMostbound' (from the group catalogs) to then trace the merger trees back in town to see if these subhalos have recently merged. Typically, the order seems to be 1) Get IDs from the group catalogs, 2) Use offsets to get the subfind IDs that are used in the trees, 3) Walk along the trees.

It seems to me like the loadMPBs function is #3 of the above steps, so I would need to figure out how to do offsets also. I want to make sure that I'm understanding this process correctly, and seeing this work in context would probably help, which relates back to question #1 above.

Thanks again, I know I've read about the difference between IDs in the documentation, but I keep getting turned around.

Dylan Nelson
  • 9 Oct '20

Hi Rebecca,

The sublink.py file contains the single tree loading function, which will load only the MPB if onlyMPB = True.

"Subhalo ID" and "Subfind ID" are just two names for the same thing.

Note that SubhaloIDMostbound is the ID of an individual particle.

A process would be: (1) load group catalogs at redshift 4 and select subhalos ids of interest, (2) load all trees of all these ids with the above function, (3) derive some quantity of interest from the tree of each subhalo.

  • Page 1 of 1