Illustris - Simple way to greatly expand API usability

Zephyr Penoyre

1
17 Apr '17

tl;dr - A small change to the API, detailed in bold below, could turn it into a fully functional way to access all of Illustris data efficiently and effectively with no great time/memory costs. If the halo(/subhalo) catalogues were quick and easy to access then starting any project from there is a really useful general way to access the data and complete almost any project with only a laptop and wifi.

Hi all, I've been playing around with the API for a while, and think there's some fantastic potential in there, to be able to get hold of and analyse illustris galaxies (and the simulation at large) without the need for huge memory storage or computing power.

That said, I find the current way of navigating the data rather confusing and difficult. I don't think json notation's actually that useful for these huge datasets, as there's both a) a lot of apples to oranges in the data (one property of the object is a web address, another's a scalar quantity, another's a long list of subhalo indices etc.) and b) difficulties trying to view the dataset as a whole (as we're given the data in browser manageable chunks).

I think there's a huge amount of potential and cool stuff in there, and it shouldn't go anywhere, but it doesn't fit with the workflow that I, and everyone I've queried about it uses with Illustris data:

pull up the halo(/subhalo) catalogue in full
search for the 10, 100, or 10000 galaxies which have properties you're interested in
pull the snapshot data up for those galaxies individually (4. often times link this with merger tree data or at least repeat over a number of snpashots, but let's not worry about that here)

The big issue here, stopping the API from being not just a fantastic tool but a simple one, and the only one needed to get the full benefit of illustris data, is the relative difficulty and computing cost of retrieving the halo(/subhalo) catalogue.

E.g. for illustris 1, the catalogue is in 8(~ish) parts, each which of takes me ~ a minute to download, and adds up to a significant total filesize. And yet once I have this data I'm most likely only searching three or four of the ~100 fields (not including that some of these have 3, 6 or 8 values associated with them). All the time spent downloading the whole thing, and space storing it, even if only temporarily, is wasted.

Once I have that data, and have picked the relevant galaxies, the API becomes a fantastically useful tool. It allows me to do everything I should wish never needing more than a few seconds of download time and a handful of mb of temporary data.

So, I have a solution, but it requires a reshuffle (or addition) of the way halo(/subhalo) data is stored and accesible online:

Don't store the halo(/subhalo) catalogue in hdf5 chunks, unsuable on their own and unmanageable altogether

Store each field of the catalogues at it's own address (e.g. all "GroupBHMass" is accesible on one page, and only one page)

Encourage users to follow a workflow where they pull up these (relatively manageable) slices of the catalogue, whichever cocktail suits them, then picks the relevant halos(/subhalos) and pulls those out one by one from the API

This is both a lot more usable and intuitive than the query system and makes the vast majority of projects completely manageable (in time and memory) using just a laptop and an internet connection.

I'd love to help set this up, I'm no expert in APIs but am happy to help. This all stemmed from me trying to set up some simple packages to do this for the user, but without these alterations I've realised there's a major time and memory bottleneck. It also makes the workflow, to users new to the data, much more generalizable and understandable.

That said, I may well have missed half a hundred caveats, technical or philosophical, so let me know if this idea is flawed! best, Zephyr

Dylan Nelson

1
24 Apr '17

Hey Zephyr,

We've already discussed this a bit, but now just 2 questions.

You want essentially a HDF5 format return of one (or more) group catalog fields, in a single HDF5 file?
Can you give me an example of a search, or two, that cannot be achieved using the current search API?

Zephyr Penoyre

19 May '17

Hi Dylan, You could either implement a HDF5 file for each field in the group catalogues, or have them as a single hdf5 catalogue that you can download just a subsection of. As an example of a project that the current API is not well suited for, imagine I was looking for the angular momentum as a function of radius for galaxies with a low gas fraction (basically random example, probably not that motivated).

Ideally I could quickly and easily pull out the gas and stellar mass for all subhalos for a given a given snapshot, divide the former by the latter, record the subhalos and then loop through them making angular momentum profiles.

Right now the only way I could do this, to my knowledge, is download the whole, multiple Gb, group cat, and read stellar and gas masses from that. But only a fraction of that group cat is useful relevant information. If I could access just stellar and gas mass for all subhalos without needing to download and store the whole groupcat I would only need a laptop, a few Mb of storage space and an internet connection to identify, access and analyze every relevant galaxy.

And with a small library of simple commands tasks like these become very easy. e.g. gasMass,stellarMass=getSubhalos(whichSim,whichSnap,fields=['GasMass','StellarMass']) #pulls the relevant group cat from online gasFrac=gasMass/stellarMass whichGal=np.where(gasFrac<0.01 & stellarMass>101.5 & stellarMass<102.5) #gas poor milky way mass galaxies for gal in whichGal: stellarPos, stellarMass, stellarVel=getSubhalo(whichSim,whichSnap,fields=['stellarPos','stellarMass','stellarVel']) #pulls the relevant particle data from online findAngMom(stellarPos,stellarVel,stellarMass) #actual function that finds the profile

p.s. hope this makes sense, if not let's find a time to talk in person because I think we're going back and forth a lot here.

Dylan Nelson

1
2 Jun '17

Hi Zephyr,

I've implemented this functionality:

2 June, 2017: New functionality has been added to the web API. Individual fields can now be downloaded from group catalogs, without needing to download the entire catalog. See the API reference documentation under [base]/groupcat-{num}/?{subset_query}. For example, this link downloads just the Illustris-1 z=0 subhalo SFRs.

I hope you find this useful, let me know if you have any questions (or encounter any problems).

Zephyr Penoyre

21 Jul '17

This is fantastic, works really well, thank you!

Zephyr Penoyre

21 Jul '17

This is fantastic, works really well, thank you!

Public Data Access Overview / Discussion Forum

Simple way to greatly expand API usability