GDAL meets EDINA

Martin Daly has started posting on A Higher Education with details about use case of GDAL to serve large datasets through Web:

We use GDAL to read the files, and were opening them via GDALOpenShared, so that GDAL only opened the file once and used reference counting to manage the lifetime of the GDALDataset object. Unfortunately (for us) GDAL is not thread safe. This isn’t a criticism, the fault is entirely ours for using it in this way.

Criticism or not, the reality is that we (software developers) have already jumped to an era of parallelism (count number of physical or logical CPUs in your computer) where thread-safety becomes a minimum requirement as basic as avoiding buffer overruns.

5 thoughts on “GDAL meets EDINA

  1. I don’t necessarily agree that GDAL should be entirely thread-safe. This is a topic where pros and cons must be properly weighted. Thread safety has a cost that can be not neglectable in use cases where people don’t need it. IMHO, the true target is to achieve proper re-entrency.

    And, in read-only scenarios, GDAL can be safely used in a multithreading context as GDAL core and drivers for most popular formats are re-entrant (the note on the wiki sounds a bit too pessimistic with respect to the current situation).

    To achieve that, I see 2 possibilities :
    * either you use distinct dataset handles for each thread. The obvious drawback is that you can’t benefit from potential shared cached blocks.
    * either you use the same dataset handle under explicit locking. That’s what a thread-safe GDAL API would do in fact : for file based formats, you must be sure that one thread will not mess with the file pointer used by another thread, so you would necessarily end up by serializing access to the dataset, which makes the interest of sharing the dataset less obvious…

    I’d also note that the documentation of GDALOpenShared() insists on the fact the returned handle cannot be used at the same time by different threads. In case people wonder about the interest of GDALOpenShared() , I’d mention the VRT driver where the same underlying dataset can be referenced by the sources from different bands.

    The whole subject would be certainly interesting to discuss on gdal-dev ;-) By the way, a recently proposed RFC (http://trac.osgeo.org/gdal/wiki/rfc26_blockcache) discusses about thread-safety considerations in GDAL global block cache.

  2. Even,

    Re-entrancy, as stronger requirement, assumes thread-safety. However, thread-safety does not necessarily mean re-entrancy. So, how you think to provide re-entrancy but dropping thread-safety at the same time. Perhaps you mix concept of parallelism – GDAL performing multiple tasks in parallel, internally as a black box.

    Also, it is feasible to provide mechanism to disable thread-safety features during compile-time. Similarly to well-written parallel implementation of an algorithm, if it’s well designed, it should compile and execute in single-threaded mode as well as in multi-threaded mode.

    Regarding the cost of thread safety, in my opinion it hardly would be comparable to general cost of operations performed by GDAL. GDAL does hard work that costs a lot itself. Of course, assuming there are no locks per pixel or short scanline, etc.

    Another issue is that making GDAL re-entrant will most likely require substantial alternation of GDAL public API. Given GDAL development process, I can’t really imagine it happening in near future. There are many show stoppers on the horizon :-)

    Regarding the reading, note in the FAQ says ”as long as no two threads access the same GDALDataset object at the same time”. If it is more pessimistic than the actual status of implementation, would be good if someone knowledgeable could update it.

    I’m no longer subscribed to gdal-dev, so I’ll take the liberty to continue posting here from time to time.

    By the way, in case it’s relevant, a simple example of well-designed (simple) algorithms for raster processing can be found as small 2 headers extension to Adobe/Boost GIL library, it is gil threaded. The idea of algorithmic approach to thread-safety is interesting, from readme: alg(x + y) = alg(x) + alg(y)

    Thanks for sharing your thoughts Even!

    Mat

  3. About re-entrancy and thread-safety definitions, this is becoming funny…

    I think I’ve been consistant with the definitions given in http://trac.osgeo.org/gdal/wiki/rfc16_ogr_reentrancy, that is pretty close to http://qt.nokia.com/doc/4.6/threads-reentrancy.html which also says :
    “Hence, a thread-safe function is always reentrant, but a reentrant function is not always thread-safe.”

    But the wikipedia page (http://en.wikipedia.org/wiki/Reentrant_(subroutine)) that deals with re-entrency seem to agree with your definitions : “Every reentrant function is thread-safe; however, not every thread-safe function is reentrant” .

    So, as often, words are the source of confusions ;-)

    Let the (pseudo)code speak for me. What I mean is you can (currently) do :
    - Thread T1 : GDALRasterIO(hDataset1, GF_Read, …)
    - Thread T2 : GDALRasterIO(hDataset2, GF_Read, …)

    But you cannot (generally (*)) do (without external lock) :
    - Thread T1 : GDALRasterIO(hDataset, GF_Read, …)
    - Thread T2 : GDALRasterIO(hDataset, GF_Read, …)

    (*) : this would work if the pixel window that is read is already in the block cache. In this scenario, the IReadBlock() method of the underlying driver does not need to be called and as the relevant part in GDAL core is thread-safe, this would work. Of course, this is not very practical to check.

  4. It’s getting clearer. Qt admits that it adapts the terms very differently to how it is understood by C or POSIX:

    POSIX uses a somewhat different definition of reentrancy and thread-safety for its C APIs

    In any case, reentrancy is different animal than thread-safety. According to traditional understanding of both in Unix world, reentrancy implies thread-safety.

    GDALRasterIO would be thread-safe if code of the function can be safely executed by multiple, meaning concurrent, threads.
    GDALRasterIO would be reentrant if it is safe to perform another execution of GDALRasterIO while previous one was interrupted. In the latter case there are no implications or assumption about number of threads executing it.

    GDALRasterIO can be thread-safe, using one of locking mechanisms internally, but this will not necessary make the function reentrant.

    In your pseudo-code examples, the first case does not require thread-safety but reentrancy. The second example, IMHO, requires both, because hDataset is not thread safe object itself.

  5. Pingback: GIS-Lab Blog» ????? ????? » ??????? ??????

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>