Research tools

In this section are listed the different tools used at RSES for handling and treating data. We further advise taking a tour of the iEarth website for further details on the tools developed at RSES.

Data confinement

Several strategies are used for handling and storing data with a long-term capacity, a critical parameter in today's world where the amount of data is growing exponentially. Text files are often a first place to start, but several formats exist and may be useful to handle the data and store mata-information, critical for long-term storage and ensuring the usefulness of data in the future.

Speadsheet

Spreadsheets allow organising and storing data in tabular form, and limited data analysis. The first name for spreadsheet software in the mind of many people is Excel. A great open source and free alternative is provided by LibreOffice. Numerical-focused programming languages can interact with spreadsheets via built-in or additional libraries. Google Sheets are also work a look, particularly for their fantastic collaboration features.

SQL

SQL is a domain-specific language that allows you to store data in a relational database. You can launch queries to add, delete, update, look at specific data. Very powerful for data organised as spreadsheets where queries (missing in Excel and LibreOffice) are needed. For a free version, SQLite is a must (particularly its Firefox manager, and can easily interact with the free languages R, Python or Julia.

HDF5

Hierarchical Data Format, a format designed to store and organize large amounts of data. Usual numerical-focused programming languages offer plenty of options for saving and loading HDF5 data.

Programming languages

Four high-level programming languages focused on numerical analysis are currently in use at RSES: Matlab, Python, R, and Julia.

Python

Python is a general high-level programming language, offering excellent numerical capabilities through its scientific libraries, the big three being SciPy, NumPy and Matplotlib. If you're new, a good place to start is the Anaconda installer, which makes installation and maintenance relatively painless. Python is sometimes considered slower than other (compiled) languages, but it's syntax is very elegant, and really good for first-time programmers.

R

R is an open-source version of the S language, developed in the 1980's in the Bell lab and aimed at statistical computing. Many different libraries are available through the CRAN repositories. As Python, R may not be the fastest language for heavy numerical computations, but benefits of many data analytic and statistic tools.

Julia

Julia is a relatively new but growing language. It is a high level language aimed at numerical computations, simple as Matlab, fast as C or Fortran, expressive as Python, and with a central repository system as R. Julia is quite new so not everything is available in the libraries, but it is a fast-growing language very pleasant to use. The advantage of Julia is that Python libraries can be directly called inside Julia code, as C or Fortran functions (a no wrapper policy is in place). Julia thus solves the famous "two languages" problem.

Matlab

Matlab is available for any member of the ANU; please consult the page http://matlab.anu.edu.au/ for information. Matlab is the go-to language in many domains, but it is also very expensive and its data analytics / machine learning libraries may be a bit outdated compared to the rapidly evolving open source libraries in other languages that are directly supported by data scientists. Free alternatives are Scilab and Octave.

Low level programming languages

Fortran and C are also used at RSES for specific applications, notably in Geophysics. Those languages can offer very high computational speed for specific applications. However, code prototyping and routine data analysis are not easy with such languages. For such tasks, high-level languages with a numerical analysis focus (as listed above) are used and recommended.

Text Editors

Having a good text editor can be the difference between repeatedly banging your head against a wall, and skipping happily through a flowering meadow. A good editor will make programming as straightforward as possible by providing things like auto-completion, syntax highlighting and checking, de-bugging and version-control integration. There are lots of options out there. These are a few of our favourites. All are cross-platform, and available on all modern operating systems.

Visual Studio Code*

Fast, easy to set up, multiple language support, 'intelligent' auto-completion, customisable. Out-of-the-box Git integration and debugging. Huge 'extensions' library to add extra functionality. Made by Microsoft (?!). Ease of use makes this an excellent place to start if you're new to coding.

Atom*

GitHub's offering. Similar functionality to VScode - slightly more customisable, but no built-in debugging tools. Lots of extensions available to add these functions, though. Con: slower than vscode.

Sublime Text*

The predecessor to Atom and VSCode. Costs money ($70). Still an excellent editor, but really no reason to pay for it when the others are equally (if not more) capable, and free.

Vim

One of the 'original' text editors. You probably already have it installed, but don't know about it. Try typing vim in your terminal, and you'll probably find it there. Very capable, but initially impenetrable.

Emacs

The 'real geeks' editor. You can control pretty much everything on your computer from Emacs. STEEP learning curve can be offputting.

Spacemacs

The best of Vim and Emacs, combined. "Makes Emacs useable!" (Branson, 2018).

* If you're just starting out, pick one of these three. They're functionally very similar, hugely capable and excellent. For complete beginners Visual Studio Code is probably the best choice, as they've put a lot of effort into making the set-up and customisation process as painless as possible, and the extensions framework is more intuitive.

Virtual environments

Local virtual machines

Virtual systems are used to create virtual computing systems, allowing for instance to use Linux on a Windows or Mac system. This can allow, for example, to use Python or Julia in Linux for enjoying libraries that are not "Windows ready". Several options are available, commercial or free open-source. For starting, we recommend using the free Virtualbox from Oracle.

Containers

Containers are lightweight virtual environments designed to allow one to easily distribute and run a piece of software on any machine, regardless of its architecture. Such approach can be particularly successful art providing ready-to-go environments for new users. More information can be found on the website of the famous container provider Docker.

Libraries

We use and also develop specific libraries at RSES, aimed for treatment of Geoscience data in various domains, as listed below.

General Libraries

SciPy Scientific Python. A diverse range of data analysis and manipulation tools, from statistical tests to data fitting and spectral analysis. Python
Pandas A mature 'data frame' library for organising and analysing data. Think spreadsheets but for Python, and with much more advanced data processing capabilities. Python
NumPy Numeric Python. The workhorse for array-based numeric calculations in Python. Python
SymPy Symbolic algebra! Write, manipulate and solve equations. Python
Matplotlib Publication-ready plots, with a few lines of code. Python
Jupyter A browser-based interface to Python (and other languages), which allows in-line coding, note-taking and plotting. Python

Optimisation - Traditional

Scipy Optimise The SciPy Python library contains various optimisation algorithms for solving linear and non-linear problems. Python
JuliaOpt libraries The JuliaOpt GitHub organization is home to a number of optimization-related packages written in Julia. Julia
Matlab Optimisation Toolbox (non free) The Matlab optimisation toolbox contains various algorithms for performing optimisation and model fitting. Matlab

Optimisation - Probabilistic

RJ-MCMC Reversible-Jump Markov Chain Monte Carlo library developed at the Research School of Earth Sciences. See the Jupyter Notebook example here! Python
emcee - The MCMC Hammer Goodman & Weare’s Affine Invariant Markov chain Monte Carlo (MCMC) Ensemble sampler. Python
PyMC3 Markov-Chain Monte-Carlo calculations in Python. Python
Mamba An open platform for the implementation and application of MCMC methods to perform Bayesian analysis. Julia
Stan Platform for statistical modelling and high-performance statistical computation. Various

Machine Learning

Scikit-Learn A very mature library for data analysis using machine-learning techniques. Used in Spectra.jl (see below), for instance. Python
Orange Orange is an open source visual programming software for data mining and visualization. Python/Standalone
Shogun Shogun is an open source C++ library for machine learning. Various
Dlib C++ Dlib C++ is an open source C++ library for machine learning. Python/C++
mlpack mlpack is an open source C++ library for machine learning. C++
TensorFlow TensorFlow is an open source software library developed by Google for machine learning. Python/Various
Theano Theano is an open source Python library developed by the Lisa Lab in Montreal for machine learning, and particularly aimed at deep learning. Python
Torch Torch is a scientific computing framework with wide support for machine learning algorithms that puts GPUs first. See also PyTorch, for a Python interface to Torch. LuaJIT

Deep Learning

Keras A high-level neural networks language that works focuses on easy implementation of deep neural networks. Works on top of both Theano and TensorFlow. Python
Lasagne Lasagne is a lightweight library to build and train neural networks in Theano. Python
Caffe Deep Learning framework. C++
Mocha Deep Learning framework, inspired by Caffe. Julia

Spectroscopy

Spectra.jl Library for spectroscopic (Raman, Infrared, Nuclear Magnetic Resonance, XAS...) data treatment and analysis. Julia
RamPy Library for spectroscopic (Raman, Infrared, Nuclear Magnetic Resonance, XAS...) data treatment and analysis. Python
gcvspline A small package wrapping the gcvspl.f FORTRAN library. Python

Geochemistry

ViscoAG Software for calculating the viscosity of silicate melts based on the knowledge of their structure. Julia
LAtools Tools for processing Laser Ablation Mass Spectrometry data. Python