Accidentally Quadratic

Accidentally Quadratic (with Lists). #

Certain computations can be conveniently be expressed as for each item do something. Sometimes it’s a trap. If it’s obvious that the complexity of the total calculation is linear (or N log(N)), then this conceptually easy approach can turn accidentally quadratic if the “something” doesn’t run in constant (or logarithmic) time. A typical example is concatenating strings as follows:

ret = ""
for s in strings:
  ret = ret + s
return ret;

Because ret slowly grows on each iteration, each concatenation is linear in the sum of ret and s. Therefore, if len(strings) is big and the longest string in strings is short, the operation is quadratic. Obviously, it can be done in linear time, if one preallocates sum(len(s) for s in strings) and then copies each s to its place in the output buffer.

It’s slightly vicious because:

ret = []
for s in strings:
  for c in s:
      ret.append(c)

is not (typically) accidentally quadratic. In Python it’s linear, because we can append to a list in O(1). In C++ if [] is an std::vector, it’s also linear because of how a vector grows by doubling it’s capacity.

The HighFive Variant. #

In HDF5 a dataset is a multi-dimensional array, e.g. size [n0, ..., nN]. Using regular hyperslabs, one can select regular sub-regions, i.e. the product of intervals, or more simply put “rectangles”. One can then select arbitrary parts of the dataset by computing the union of regular sub-regions. It’s not important consider that they could overlap.

We know that HDF5 stores the selection as a sorted list of regular hyperslabs. Which is great because if it’s sorted we can use bisection to find where we need to insert the new sub-region in O(log(N)) and because it’s a list we can insert in O(1).

The fallacy: lists don’t have random access. Hence, unless they support something like skip-lists (which HDF5 doesn’t), bisection doesn’t work and instead it’s O(N) to find the location to insert the item.

Overall, iteratively computing the union of regular hyperslabs takes O(N**2).

The Fix. #

The pragmatic way of avoiding the issue is by learning that HDF5 supports computing the union of two arbitrarily complex hyperslabs via H5Scombine_select. This opens up the path for divide and conquer.

We’re belabouring the point too much already, but here’s the commit that implements the change in HighFive.