In Praise of Small Tools, or a short Ode to the CommandLine
In Praise of Small Tools, or a short Ode to the CommandLine¶
Given the huge scope of geological problems, it is not reasonable to assume that all of these can be solved using large, monolithic solutions. The requirements of a geochemist will be different to the requirements of a structural geologist. As such, they will work with different computational approaches. However, even within a given field, different problems need to be solved, so your monolithic tool that can handle gravity models does not work with magnetic fields, or you have a weird edgecase that needs an additional preprocessing step that can not be accommodated easily. Frequently, this additional capability can be included in a monolithic tools using some plugin system or other. This may or may not require an additional payment for each plugin. There must be a more sensible way, surely?
For those not familiar with the Unix/Linux environment, there is a design philosophy that favours small, single-use tools. An important part of this is that the various tools need to be able to feed the data that they generate into the next tool, to create an entire workflow. An example of this would be:
cat data.csv | grep i ‘sio2’ > silica.csv
This will print the entirety of data.csv
, but then will search for any lines with ‘sio2’ in them. This is then put into a new file, named silica.csv
.
My perfect geocomputing world would be arranged similarly. Imagine starting with your datafile, and running it through a series of small programs. Each of these programs would deal with one part of the problem: removing poor data, selecting the datapoints, running the analyses, and creating your data visualisation. If you need to change one aspect of this, the rest of your workflow can be left intact. This can certainly be done with various graphical interfaces, but is easier with small tools once you have done it once, if you have a way to record or script what you have done. You can go back and look at what you have done, and tweak things, if you need to. So maybe your selection of data points needs to be expanded, or reduced. In a toolchain environment this is done by simply altering the parameters to the small program that selects the data.
This also allows people to focus on solving particular sub-problems. Like how you will select the data points so that they can be changed, and improved or even how their types can be changed. You can also drop a new step into the toolchain with relative ease. Maybe you need to add a fudge factor to all your readings as a preprocessing step, but after you have selected your data points. All this can be done without making anything else work differently.
An obvious requirement for all this to work is that all your tools need to read and write to standard formats. These do exist, for various datatypes: image formats such as .png
or .svg
, and various non-binary things such as .csv
or .xml
files. If these are done with sensible, well understood headers, there is no reason that we can not start to work on small, self contained useful tools which we can start to chain together.