In this section are listed the different tools used at RSES for handling and treating data. We further advise taking a tour of the iEarth website for further details on the tools developed at RSES.
Several strategies are used for handling and storing data with a long-term capacity, a critical parameter in today's world where the amount of data is growing exponentially. Text files are often a first place to start, but several formats exist and may be useful to handle the data and store mata-information, critical for long-term storage and ensuring the usefulness of data in the future.
Spreadsheets allow organising and storing data in tabular form, and limited data analysis. The first name for spreadsheet software in the mind of many people is Excel. A great open source and free alternative is provided by LibreOffice. Numerical-focused programming languages can interact with spreadsheets via built-in or additional libraries. Google Sheets are also work a look, particularly for their fantastic collaboration features.
SQL is a domain-specific language that allows you to store data in a relational database. You can launch queries to add, delete, update, look at specific data. Very powerful for data organised as spreadsheets where queries (missing in Excel and LibreOffice) are needed. For a free version, SQLite is a must (particularly its Firefox manager, and can easily interact with the free languages R, Python or Julia.
Hierarchical Data Format, a format designed to store and organize large amounts of data. Usual numerical-focused programming languages offer plenty of options for saving and loading HDF5 data.
Four high-level programming languages focused on numerical analysis are currently in use at RSES: Matlab, Python, R, and Julia.
Python is a general high-level programming language, offering excellent numerical capabilities through its scientific libraries, the big three being SciPy, NumPy and Matplotlib. If you're new, a good place to start is the Anaconda installer, which makes installation and maintenance relatively painless. Python is sometimes considered slower than other (compiled) languages, but it's syntax is very elegant, and really good for first-time programmers.
R is an open-source version of the S language, developed in the 1980's in the Bell lab and aimed at statistical computing. Many different libraries are available through the CRAN repositories. As Python, R may not be the fastest language for heavy numerical computations, but benefits of many data analytic and statistic tools.
Julia is a relatively new but growing language. It is a high level language aimed at numerical computations, simple as Matlab, fast as C or Fortran, expressive as Python, and with a central repository system as R. Julia is quite new so not everything is available in the libraries, but it is a fast-growing language very pleasant to use. The advantage of Julia is that Python libraries can be directly called inside Julia code, as C or Fortran functions (a no wrapper policy is in place). Julia thus solves the famous "two languages" problem.
Matlab is available for any member of the ANU; please consult the page http://matlab.anu.edu.au/ for information. Matlab is the go-to language in many domains, but it is also very expensive and its data analytics / machine learning libraries may be a bit outdated compared to the rapidly evolving open source libraries in other languages that are directly supported by data scientists. Free alternatives are Scilab and Octave.
Low level programming languages
Fortran and C are also used at RSES for specific applications, notably in Geophysics. Those languages can offer very high computational speed for specific applications. However, code prototyping and routine data analysis are not easy with such languages. For such tasks, high-level languages with a numerical analysis focus (as listed above) are used and recommended.
Having a good text editor can be the difference between repeatedly banging your head against a wall, and skipping happily through a flowering meadow. A good editor will make programming as straightforward as possible by providing things like auto-completion, syntax highlighting and checking, de-bugging and version-control integration. There are lots of options out there. These are a few of our favourites. All are cross-platform, and available on all modern operating systems.
Fast, easy to set up, multiple language support, 'intelligent' auto-completion, customisable. Out-of-the-box Git integration and debugging. Huge 'extensions' library to add extra functionality. Made by Microsoft (?!). Ease of use makes this an excellent place to start if you're new to coding.
GitHub's offering. Similar functionality to VScode - slightly more customisable, but no built-in debugging tools. Lots of extensions available to add these functions, though. Con: slower than vscode.
The predecessor to Atom and VSCode. Costs money ($70). Still an excellent editor, but really no reason to pay for it when the others are equally (if not more) capable, and free.
One of the 'original' text editors. You probably already have it installed, but don't know about it. Try typing
vim in your terminal, and you'll probably find it there. Very capable, but initially impenetrable.
The 'real geeks' editor. You can control pretty much everything on your computer from Emacs. STEEP learning curve can be offputting.
The best of Vim and Emacs, combined.
"Makes Emacs useable!" (Branson, 2018).
* If you're just starting out, pick one of these three. They're functionally very similar, hugely capable and excellent. For complete beginners Visual Studio Code is probably the best choice, as they've put a lot of effort into making the set-up and customisation process as painless as possible, and the extensions framework is more intuitive.
Local virtual machines
Virtual systems are used to create virtual computing systems, allowing for instance to use Linux on a Windows or Mac system. This can allow, for example, to use Python or Julia in Linux for enjoying libraries that are not "Windows ready". Several options are available, commercial or free open-source. For starting, we recommend using the free Virtualbox from Oracle.
Containers are lightweight virtual environments designed to allow one to easily distribute and run a piece of software on any machine, regardless of its architecture. Such approach can be particularly successful art providing ready-to-go environments for new users. More information can be found on the website of the famous container provider Docker.
We use and also develop specific libraries at RSES, aimed for treatment of Geoscience data in various domains, as listed below.
|SciPy||Scientific Python. A diverse range of data analysis and manipulation tools, from statistical tests to data fitting and spectral analysis.||Python|
|Pandas||A mature 'data frame' library for organising and analysing data. Think spreadsheets but for Python, and with much more advanced data processing capabilities.||Python|
|NumPy||Numeric Python. The workhorse for array-based numeric calculations in Python.||Python|
|SymPy||Symbolic algebra! Write, manipulate and solve equations.||Python|
|Matplotlib||Publication-ready plots, with a few lines of code.||Python|
|Jupyter||A browser-based interface to Python (and other languages), which allows in-line coding, note-taking and plotting.||Python|
Optimisation - Traditional
|Scipy Optimise||The SciPy Python library contains various optimisation algorithms for solving linear and non-linear problems.||Python|
|JuliaOpt libraries||The JuliaOpt GitHub organization is home to a number of optimization-related packages written in Julia.||Julia|
|Matlab Optimisation Toolbox (non free)||The Matlab optimisation toolbox contains various algorithms for performing optimisation and model fitting.||Matlab|
Optimisation - Probabilistic
|RJ-MCMC||Reversible-Jump Markov Chain Monte Carlo library developed at the Research School of Earth Sciences. See the Jupyter Notebook example here!||Python|
|emcee - The MCMC Hammer||Goodman & Weare’s Affine Invariant Markov chain Monte Carlo (MCMC) Ensemble sampler.||Python|
|PyMC3||Markov-Chain Monte-Carlo calculations in Python.||Python|
|Mamba||An open platform for the implementation and application of MCMC methods to perform Bayesian analysis.||Julia|
|Stan||Platform for statistical modelling and high-performance statistical computation.||Various|
|Scikit-Learn||A very mature library for data analysis using machine-learning techniques. Used in Spectra.jl (see below), for instance.||Python|
|Orange||Orange is an open source visual programming software for data mining and visualization.||Python/Standalone|
|Shogun||Shogun is an open source C++ library for machine learning.||Various|
|Dlib C++||Dlib C++ is an open source C++ library for machine learning.||Python/C++|
|mlpack||mlpack is an open source C++ library for machine learning.||C++|
|TensorFlow||TensorFlow is an open source software library developed by Google for machine learning.||Python/Various|
|Theano||Theano is an open source Python library developed by the Lisa Lab in Montreal for machine learning, and particularly aimed at deep learning.||Python|
|Torch||Torch is a scientific computing framework with wide support for machine learning algorithms that puts GPUs first. See also PyTorch, for a Python interface to Torch.||LuaJIT|
|Keras||A high-level neural networks language that works focuses on easy implementation of deep neural networks. Works on top of both Theano and TensorFlow.||Python|
|Lasagne||Lasagne is a lightweight library to build and train neural networks in Theano.||Python|
|Caffe||Deep Learning framework.||C++|
|Mocha||Deep Learning framework, inspired by Caffe.||Julia|
|Spectra.jl||Library for spectroscopic (Raman, Infrared, Nuclear Magnetic Resonance, XAS...) data treatment and analysis.||Julia|
|RamPy||Library for spectroscopic (Raman, Infrared, Nuclear Magnetic Resonance, XAS...) data treatment and analysis.||Python|
|gcvspline||A small package wrapping the gcvspl.f FORTRAN library.||Python|
|ViscoAG||Software for calculating the viscosity of silicate melts based on the knowledge of their structure.||Julia|
|LAtools||Tools for processing Laser Ablation Mass Spectrometry data.||Python|