Research tools

In this section are listed the different tools used at RSES for handling and treating data.

Data confinement

Several strategies are used for handling and storing data with a long-term capacity, a critical parameter in today's world where the amount of data is growing exponentially. Text files are often a first place to start, but several formats exist and may be useful to handle the data and store matainformation, critical for long-term storage and ensuring the usefullness of data in the future.

Speadsheet

Speadsheet softwares allow analysing, organising and storing data in tabular form. The first name for spreadsheet softwares in the mind of many people is [Excel](https://products.office.com/en/excel). A great open source and free alternative is provided by [LibreOffice](https://www.libreoffice.org/download/download/). Numerical-focused programming languages can interact with spreadsheets via built-in or additional libraries.

SQL

SQL is a domain-specific language that allows you to store datsa in a relational database. You can launch queries to add, delete, update, look at specific data. Very powerful for data organised as spreadsheets where queries (missing in Excel and Libreoffice) are needed. For a free version, [SQLite](https://www.sqlite.org/) is a must (particularly its [Firefox manager](https://addons.mozilla.org/fr/firefox/addon/sqlite-manager/)), and can easily interact with the free languages R, Python or Julia.

HDF5

Hierarchical Data Format, a format designed to store and organize large amounts of data. Usual numerical-focused programming languages offer plenty of options for saving and loading HDF5 data.

Programming languages

Four high-level programming languages focused on numerical analysis are currently in use at RSES: Matlab, Python, R, and Julia.

Matlab

Matlab is available for any member of the ANU; please consult the page http://matlab.anu.edu.au/ for information. Matlab is the go-to language in many domains, but it is also very expensive and its data analytics / machine learning libraries may be a bit outdated compared to the rapidly evolving open source libraries in other languages that are directly supported by data scientists. Free alternatives are Scilab and Octave.

Python

Python is a general high-level programming language, offering excellent numerical capabilities through its scientific libraries, the big three being Scipy, Numpy and Matplotlib. We recommand installing it using the Anaconda installer, as it makes its installation and maintenance very easy. Python is sometimes considered as slower than other language (this is not always true...), but it's syntax is very elegant, and really good for first-time programmers.

Julia

Julia is a relatively new but growing language. It is a high level language aimed at numerical computations, simple as Matlab, fast as C or Fortran, expressive as Python, and with a central repository system as R. Julia is quite new so not everything is available in the libraries, but it is a fast-growing language very pleasant to use. The advantage of Julia is that Pythion libraries can be directly called inside Julia code, as C or Fortran functions (a no wrapper policy is in place). Julia thus solves the famous "two languages" problem.

R

R is an open-source version of the S language, developped in the 1980's in the Bell lab and aimed at statistical computing. Many different libraries are available through the CRAN repositories. As Python, R may not be the fastest language for heavy numerical computations, but benefits of many data analystic and statistic tools.

Low level programming languages

Fortran and C are also used at RSES for specific applications, notably in Geophysics. Those languages can offer very high computational speed for specific applications. However, code prototyping and routine data analysis are not easy with such languages. For such tasks, high-level languages with a numerical analysis focus (as listed above) are used and recommanded.

Virtual environments

Local virtual machines

Virtual systems are used to create virtual computing systems, allowing for instance to use Linux on a Windows or Mac system. This can allow, for example, to use Python or Julia in Linux for enjoying libraries that are not "Windows ready". Several options are available, commercial or free open-source. For starting, we recommand using the free Virtualbox from Oracle.

Containers

Containers are lightweight virtual environments designed to allow one to easily distribute and run a piece of software on any machine, regarless of its architecture. Such approach can be particularly sucessful art providing ready-to-go environments for new users. More information can be found on the website of the famous container provider Docker.

Libraries

We use and also develop specific libraries at RSES, aimed for treatment of geoscience data in various domains, as listed below.

General libraries

Scikit-learn

Python library with a lot of different algorithms for data analysis and machine-learning treatment. Used in Spectra.jl (see below), for instance.

Orange

Orange is an open source visual programming software for data mining and visualization.

Shogun

Shogun is an open source C++ library for machine learning.

Dlib C++

Dlib C++ is an open source C++ library for machine learning.

mlpack

mlpack is an open source C++ library for machine learning.

TensorFlow

TensorFlow is an open source software library developped by Google for machine learning.

Theano

Theano is an open source Python library developped by the Lisa Lab in Montreal for machine learning, and particularly aimed at deep learning.

Torch

Torch is a scientific computing framework with wide support for machine learning algorithms that puts GPUs first.

Deep Learning

Keras

Keras is a deep-learning library for Theano and TensorFlow, two open source libraries for numerical computations. Keras is focused on easy implementation of deep neural networks.

Lasagne

Lasagne is a lightweight library to build and train neural networks in Theano.

Mocha

Mocha is a Deep Learning framework for Julia, inspired by the C++ framework Caffe.

Caffe

Caffe is a Deep Learning framework in C++.

Spectroscopy

Spectra.jl

Spectra.jl is a library aimed at helping spectroscopic (Raman, Infrared, Nuclear Magnetic Resonance, XAS...) data treatment written in Julia.

gcvspline

gcvspline is a small Python package wrapping the gcvspl.f FORTRAN library.

Geochemistry

ViscoAG

ViscoAG is a software written in Julia that allows calculating the viscosity of silicate melts based on the knowledge of their structure.

latools

Python tools for processing Laser Ablation mass spectrometry data.

Data science research at RSES

Under Construction