Two months ago I started working at Marinexplore, trying to simplify the process of accessing public ocean data. At Marinexplore we aggregate data from some 25k+ plataforms that are measuring temperature, salinity, oxygen, chlorophyll and other parameters at the ocean. We also aggregate satellite data and some derivative products, all in a simple and beautiful web interface that allows you to explore and visualize the available data, and prepare a customized dataset with what you need.
My task here is simple: make that data easily consumable. My initial work was implementing Opendap access to the prepared datasets, so that users can access the data directly instead of having to download, unzip, import and parse it. Opendap is a protocol for accessing "scientific" data via HTTP requests — I use "scientific" in quotes because it can be used for any kind of data, but the use is mostly restricted to the oceanographic, meteorological and climate communites.
And that's it. The .dods format is simply a big endian encoding of the data with a header. Pydap maps the data from an Opendap server into objects that behave pretty much like Numpy arrays (and record arrays), except that data is lazily downloaded from the server when it is actually needed, instead of loading everything when the object is created. (Basically it maps __getitem__ and __iter__ calls to HTTP requests to the server.)
But even though the protocol is simple, it has one problem, IMHO. The Opendap data model is too generic, so usually clients will implement only part of the data model, or different conventions of how the data is represented. Building an Opendap dataset that can be accessed by the most common clients is quite complicated, because clients expect different conventions, or issue different requests, or have different assumptions on the data. This is particularly true if the dataset mixes gridded and in-situ data, which is usually the case in Marinexplore prepared datasets.
Since I've been testing different Opendap clients I decided to write down the quirks that I found, and a small tutorial on creating compatible datasets. Most of the problems are related to in-situ data, and GrADS is by far the most troublesome client.
How do I create a compatible dataset?
First, I assume that your dataset has in-situ data, contained in an Opendap "Sequence" object. If you only have gridded data it's pretty easy to have it accessed by all Opendap clients out there. Just stick to the CF conventions, label your axes with the "axis" attribute, use a COARDS units for your time axis (like "jiffies since 1900-1-1") and you should be fine. There are still a few quirks that you need to have in mind:
- Ferret does not like an empty PATH_INFO, so you can't have a dataset at http://example.com/. It must be http://example.com/dataset.
- GrADS can't handle more than one grid definition, so avoid mixing Aquarius data (low resolution) with GHRSST (very high resolution). If you do that, GrADS will only be able to read the variables defined on the first grid that it encounters. Other variables will fail to load silently.
- Your URL must end in ".nc", because otherwise ODV will fail to open the dataset. Even though Opendap was created exactly for the reason that the format of the data on the server shouldn't matter for the client. So make your dataset URL http://example.com/dataset.nc. In Marinexplore I changed the Opendap server so that anything can be appended to the dataset URL, and I just inform ODV users to append ".nc".
- Related to the previous problem, if you want to visualize your data using IDV you need to use the "dods://" protocol instead of "http://". Why? DODS was the original name of the Opendap protocol. So inform IDV users that they actually should open the URL dods://example.com/dataset.nc.
That was the easy part. Now let's add in-situ data to the dataset.
- The first thing you'll want to do is rename all the variables in the gridded datasets. If you have variables with the same name both in the in-situ and gridded data — something that is perfectly fine in Opendap — GrADS will fail to load them, because it requests variables by name ("salinity", eg) instead of id ("insitu.salinity"). In Marinexplore I simply added the prefix "g_" to gridded variables, so we have datasets with "g_ocean_temperature" and "ocean_temperature". Ugly, I know.
- The next thing you'll want to do is make sure that in-situ data is always returned in the same order. This is required because of Ferret. If you plot, eg, salinity versus latitude/longitude Ferret will first request latitude, then longitude, then salinity, instead of requesting them combined. In our database these requests would return in different orders, which is also perfectly fine according to the Opendap specification. This means I had to add an ORDER BY statement, which slowed the query. And even then there's no guarantee that the data won't change between requests. When reading data from a CSV file instead of the database this also means three passes on the CSV file instead of just one.
- You MUST have an id variable, that could correspond to the station id, but can in fact be anything. It MUST be a string, and it MUST be called "stid". Otherwise GrADS requests will fail without warning. It took me a full day reverse-engineering a GrADS Data Server to figure that this was the problem with my server.
That's also the easy part. Now to the complicated part. Remember how I said that you can specify conditions on the URL when requesting data? For example, if we wanted data for December 2012 in the southern hemisphere we would do a request somewhat like this:
Well, GrADS doesn't like that syntax. Instead of doing the request above, GrADS would issue 31 separate requests like these:
To be fair, the syntax is still valid Opendap — the protocol syntax supports function calls, although they are rarely used. But why complicate it with an unnecessary function? This means that every Opendap server must implement this if they want GrADS users accessing their data. In earlier version of Pydap the support for server-side functions like this is optional, but it will come by default on the new 3.2 branch that should be out soon.
There's still one more thing that needs to be done. If you look at those requests you'll notice that it's asking for data between 00:00 and 00:00 of the same day. This means that practically no data will be returned, since this should give only results that where measured exactly at 00:00. To solve this the server has to check for an attribute called "grads_step" in the dataset, which defines the granularity of the time requests. This will be something like "1dy" or "1mo" or "15mn". Your server should parse that value and add it to initial value of the time request, so that you actually return the data that the client is expecting.
Once you do all that, you should be able to access the dataset with Ferret, GrADS, etc. — and of course, Pydap. Pydap will accept anything, because while writing my code I always followed Postel's law: "Be conservative in what you send, liberal in what you accept". This was one thing I quickly learned while working with a client/server protocol.
There's also a few GrADS quirks that are not really important but worth mentioning. GrADS has separate commands for opening gridded and in-situ datasets, so in order to load a dataset with both types of data you must issue two commands: "open http://example.com/dataset.nc" and "sdfopen http://example.com/dataset.nc". In this case the in-situ data will be on file 1, and the gridded dataset, on file 2. Also, when opening the gridded dataset using sdfopen GrADS will request the in-situ data for the latitude axis ("insitu.latitude"), which can take a long time depending on the dataset. So, even though you're requesting gridded data, GrADS will say "hey, there's also this in-situ data here, I guess I'll request one variable from it just for fun!".