Arch Linux setup guide tailored towards data science, R and spatial analysis

This guide does not claim to be complete. It reflects my view on how to setup a working Arch Linux system tailored towards data science, R and spatial analysis. If you have suggestions for modifications, please open an issue. Enjoy the power of Linux!

Setting up Arch Linux the old-school way (i.e. configuring all yourself) is quite tedious. To be honest, I’ve never done it myself. I’ve always used distributions like Antergos or Manjaro. These function on the one hand as an installation wrapper for Arch and on the other hand provide their own repositories. Both come with all kind of desktop environments to choose from (GNOME, KDE, XFCE, etc.). I will not go into details of desktop environments in this post. I will outline my personal setup in a separate post in the future.

Make sure to check out the ArchWiki FAQs and Arch compared to other distributions - ArchWiki to get a better understanding of Arch.

1. Installation

1.1 Setting up the partitions

Several valid concepts exists on how to partition a Linux system. The following reflects my current view:

  1. Select “Manual” partitioning when being prompted
  2. Create a partition of 1 GB. Mount point: /boot/efi. Format: fat32
  3. Create a SWAP partition which is slightly larger than your RAM size. (e.g. for 16 GB RAM use 16.5 GB partition size). Format: Linux Swap
  4. Create a 50 - 100 GB GB partition for “root”. Mount point: /. Format: ext4
  5. With the remainng space create “home”. Mount point: /home. Format: ext4

2. Installing the package manager

Currently the best wrapper around pacman is trizen. Here is a list (AUR helpers - ArchWiki) comparing alternatives (scroll to the bottom).

Install trizen:

git clone https://aur.archlinux.org/trizen.git
cd trizen
makepkg -si

In ~/.config/trizen/trizen.conf set no_edit => 1. Until v1.55 there was also a switch that saved you the prompt asking whether you really want to install the package (install_built_with_noconfirm => 1). Unfortunately it has been removed in v1.56 and now you need to pass --noconfirm to trizen commands. The first settings saves you the confirmation prompt asking whether you want to edit a PKGBUILD (i.e. the scripts that specify how to download and install a package).

2.1 Choose your shell

All Linux system come with bash (Bourne-again shell) as default. While this shell is not bad, there are better alternatives. You have to decide whether investing the time to change the shell and learn the new syntax will make a difference for you. Since you will (hopefully) use the terminal quite often (from now on), I would recommend to at least try it :) My current favorite is the fish shell.

2.1.1 zsh

The zsh (Z-shell) is highly customizable but its settings are a bit complicated. It has several advantages (file globbing, visual appearance, etc.) to bash.

A good zsh helper is prezto: GitHub - sorin-ionescu/prezto: The configuration framework for Zsh). Install (trizen zsh) and use it (zsh). My favorite theme is agnoster.

2.1.2 fish

The fish shell is similar to zsh but comes with better defaults and an easier syntax. The omf package manager is great for installing additional plugins that simplify the shell usage. Check Oh-my-fish for an introduction.

The shell is available in the standard repos and can be installed with trizen fish.

A great way to get started is to call fish_config in a fish session to configure fish to your needs in a graphical browser window.

My current theme is bobthefish.

2.2 Enabling parallel compiling

Compiling packages from source can take time. To speed up the process by enabling parallel compiling, set the MAKEVARS variable in /etc/makepkg.conf: MAKEFLAGS="-j$(nproc)". This will use all available cores on your machine for compiling.

(This options seems to be enabled by default now. However, it is still good to verify this.)

4. R

4.1 General

ccache

This library caches all C compiled code, making reoccuring package installations that use C code a lot faster.

Put the following into ~/.R/Makevars (create it if missing):

VER=
CCACHE=ccache
CC=$(CCACHE) gcc$(VER)
CXX=$(CCACHE) g++$(VER)
C11=$(CCACHE) g++$(VER)
C14=$(CCACHE) g++$(VER)
FC=$(CCACHE) gfortran$(VER)
F77=$(CCACHE) gfortran$(VER)

Additionally, install ccache on your system: trizen ccache. See this blog post by Dirk Eddelbuettel as a reference.

To use R from the shell without a prior defined mirror, you need the system libraries tcl and tk to launch the mirror selection popup (trizen -S tcl tk).

4.2 R & RStudio

R with optimized Openblas / LAPACK

Next, install either the “Intel MKL” or libopenblas to be used in favor of the standard “libRlapack/libRblas” libraries that are shipped with the default R installation. These libraries are responsible for numerical computations and have impressive speedups compared to the default libraries. Thanks [@marcosci](https://github.com/marcosci) for the hint. While the “Intel MKL” library is the fasted according to the benchmarks, its also much more complicated to install.

libopenblas will automatically be used if its installed since the default R installation on Arch is configured with the --with-blas option (see section A.3.1 in https://cran.r-project.org/doc/manuals/r-release/R-admin.html#Installation). I recommend installing the AUR package openblas-lapack as its package cominbing multiple libraries: trizen -S --noconfirm openblas-lapack.

To verify your installation in R, simply run sessionInfo() and check the printed information:

sessionInfo()

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.2.20.so

If you want to try out the “Intel-MKL” library, follow these instructions:

There is an AUR package that provides R compiled with intel-mkl named r-mkl. Note: The download size of intel-mkl is around 4 GB and takes a lot of memory during installation. Most of it will stored in the swap (around 10 GB) so make sure your SWAP space is > 10 GB.

Also to successfully install intel-mkl, you need to temporarly increase the /tmp directory as intel-mkl requires quite some space: sudo mount -o remount,size=30G,noatime /tmp.

RStudio

Use trizen -s rstudio and pick your favorite release channel. During installation R will get installed as a dependency (if you have not already done so).

4.3 Packages

Open RStudio and install the R package usethis (it will install quite a few dependencies, get a coffee) and then call usethis::browse_github_pat(). Follow the instructions to set up a valid GITHUB_PAT environment variable that will be used for installing packages from Github.

4.3.1 Task view “Spatial”

Of course you it is not required to install all packages of a task view. You will never use all packages of a task view. In my opinion, however, it is pretty neat to have one command that installs (almost) all packages I use of a certain field. I do not care about the additional packages installed.

Required system libraries:

  • jq
  • fortran
  • v8-3.14 (Some R packages (geojsonlite, etc.) require the V8 package which depends on the outdated v8-314 library)
  • tk
  • nlopt
  • gsl

trizen -S jq gcc-fortran tk nlopt gsl

trizen -s nlopt

trizen -s v8-3.14

For rJava we need to do sudo R CMD javareconf.

Now you can install the ctv package and then call ctv::install.views("Spatial"). This will install all packages listed in the spatial task view.

Packages that error during installation (Please report back if you have a working solution):

  • ProbitSpatial
  • spaMM
  • RPyGeo (Windows only)

4.3.2 Task view “Machine Learning”

Required system libraries:

  • nlopt

Packages that error during installation (Please report back if you have a working solution):

  • interval (requires Icens from Bioconductor) LTRCtrees (requires Icens from Bioconductor)

4.4 Git* repos

The easiest way (in my opinion) is to use SSH and usethis::create_from_github().

4.4.1 SSH configuration

If you have never set a “ssh keygen pair” at your local machine, please do so by calling ssh-keygen -t rsa.

If you already have a file named id_rsa.pub in your ~/.ssh folder at your local machine, skip this step! Otherwise it will override your existing one and may invalidate previous ssh connections you set up. You now have an id_rsa.pub file in a (hidden!) folder named .ssh within /home (at your local machine). (You can enable viewing hidden files/folders in the file-manager with the shortcut ALT + . (Dolphin) or CTRL + h (Nautilus)).

Next, make sure that the permissions of the ssh files are correct:

  1. The local directory in which you want to store your Github repos should have 777 permissions. This usually is not the case if you create the directory. If the permissions are wrong, usethis::create_from_github() will not be able to write files there. sudo chmod 777 ~/git.
  2. Make sure your ssh keys have the right permissions: sudo chmod 600 ~/.ssh/id_rsa, sudo chmod 644 ~/.id_rsa.pub
  3. Add your ssh-key to the keychain: ssh-add -K ~/.ssh/id_rsa

Sometimes, the “ssh agent” is not initialized when starting a new shell. You can force this behavior by putting the following either into your ~/.bash_profile (if you are using the bash shell)

if [ -f ~/.ssh/agent.env ] ; then
    . ~/.ssh/agent.env > /dev/null
    if ! kill -0 $SSH_AGENT_PID > /dev/null 2>&1; then
        echo "Stale agent file found. Spawning new agent… "
        eval `ssh-agent | tee ~/.ssh/agent.env`
        ssh-add
    fi
else
    echo "Starting ssh-agent"
    eval `ssh-agent | tee ~/.ssh/agent.env`
    ssh-add
fi

or into ~/.config/fish/config.fish (for the fish shell):

# SSH AGENT
setenv SSH_ENV $HOME/.ssh/environment

function start_agent                                                                                                                                                                    
    echo "Initializing new SSH agent ..."
    ssh-agent -c | sed 's/^echo/#echo/' > $SSH_ENV
    echo "succeeded"
    chmod 600 $SSH_ENV 
    . $SSH_ENV > /dev/null
    ssh-add
end

function test_identities                                                                                                                                                                
    ssh-add -l | grep "The agent has no identities" > /dev/null
    if [ $status -eq 0 ]
        ssh-add
        if [ $status -eq 2 ]
            start_agent
        end
    end
end

if [ -n "$SSH_AGENT_PID" ] 
    ps -ef | grep $SSH_AGENT_PID | grep ssh-agent > /dev/null
    if [ $status -eq 0 ]
        test_identities
    end  
else
    if [ -f $SSH_ENV ]
        . $SSH_ENV > /dev/null
    end  
    ps -ef | grep $SSH_AGENT_PID | grep -v grep | grep ssh-agent > /dev/null
    if [ $status -eq 0 ]
        test_identities
    else 
        start_agent
    end  
end

You can also hand over the information manually if it does not work at all:

cred <- git2r::cred_ssh_key(publickey = "~/.ssh/id_rsa.pub", privatekey = "~/.ssh/id_rsa")

This object is then passed to the credentials argument in create_from_github().

Now clone all your repos from Github, e.g. create_from_github(repo = "pat-s/oddsratio", destdir = "~/git", credentials = cred). Alternatively, you can also check if git2r::check_ssh_key() returns the correct credentials. If it returns

git2r::cred_ssh_key()
$publickey
[1] "/home/pjs/.ssh/id_rsa.pub"

$privatekey
[1] "/home/pjs/.ssh/id_rsa"

$passphrase
character(0)

attr(,"class")
[1] "cred_ssh_key"

you can also use create_from_github(repo = "pat-s/oddsratio", destdir = "~/git", credentials = git2r::cred_ssh_key()).

The little overhead is really worth it: You have a working ssh setup and by reusing the command and just replacing the repo name the cloning off all your repos is done within minutes!

4.5 R from the command line

While most people use R from within RStudio, it is important to have a proper command line setup. I often call R in a second session (besides RStudio) to update packages, run R CMD check on a package, etc. The native R command line that you get when typing R in your shell lacks a lot of features. Fortunately, there is radian. Its advantages are listed in the README of the repo.

I’ve set an alias that maps r to radian. So whenever I type r into the console and hit enter, I get a “21st century ready” R console.

trizen -S --noconfirm radian

4.6 Rprofile

The ~/.Rprofile holds several options that will be applied during R startup. However, specifying all custom functions, options and other calls in one R file can get messy. Fortunately, there is the startup package. You can put several .R files into the ~/.Rprofile.d directory. This way you can organize your custom R startup better. Also, by running startup::startup(debug = TRUE) you can actually see what happens if you start R.

You can find my settings in my Dropbox.

5. Accessing remote servers

5.1 File access (file manager)

5.1.1 fstab

There are multiple approaches how to achieve this (Auto-mount network shares (cifs, sshfs, nfs) on-demand using autofs | Patrick Schratz, fstab - ArchWiki).

Here is an example of a fstab setup for a sshfs (to Linux server) and cifs (to Windows server) mount. Append those lines to /etc/fstab; don’t overwrite the existing content as this will result in boot errors otherwise!

# sshfs
sshfs#<username>@<ip>:<remote mount point> <local mount point> fuse        reconnect,idmap=user,transform_symlinks,identityFile=~/.ssh/id_rsa,allow_other,cache=yes,kernel_cache,compression=no,default_permissions,uid=1000,gid=100,umask=0,_netdev,x-systemd.after=network-online.target   0 0

# cifs
//<ip>/<remote mount point> <local mount point> cifs        credentials=/etc/.smbcredentials.txt,uid=1000,file_mode=0775,dir_mode=0775,gid=100,sec=ntlm,vers=1.0,dom=ads.uni-jena.de,forcegid,_netdev,x-systemd.after=network-online.target 0 0

Notes:

  • (cifs) Depending how new the Windows server is, you do not need vers=1.0.
  • (cifs) Store your login credentials for the windows server in a file, e.g. /etc/.smbcredentials.txt with contents being username = <username> and password = <password>.
  • (sshfs) Copy .ssh/id_rsa to root/.ssh/ as the mount will be executed by the root user.
  • (cifs) Install the Arch Linux kernel headers for the cifs package to work (and later on for Virtualbox): trizen linux-headers

Reboot.

Both advantage and disadvantage of using fstab are that it tries to mount the directories during boot. However, often this fails. Either because of a missing network connection at this point or because you need a VPN to access a server remotely.

5.1.2 Manual mount

You can also put this information in a different file and do a manual mount when everything is ready (i.e. your machine is booted, your VPN connected). The syntax looks a bit different then:

sudo sshfs -o reconnect,idmap=user,transform_symlinks,identityFile=~/.ssh/id_rsa,allow_other,cache=yes,kernel_cache,compression=no,default_permissions,uid=1000,gid=100,umask=0 <username>@<server ip>:/ <local mount point>

sudo mount -t cifs -o credentials=<location of credentials file>,uid=1000,file_mode=0775,dir_mode=0775,gid=100,sec=ntlm,vers=1.0,dom=<domain name>,forcegid //<server ip>/<shared folder>     <local mount point>

I won’t go into details of all options I used here. Check out the manual pages of the respective protocols if you are facing errors.

5.1.3 Executing the mount

If you use fstab, you can mount all mounts with sudo mount -a.

The manual approach needs to be saved in a bash script and called from your shell with bash <filename>.sh.

To avoid conflicts when remounting (after a network disconnect or similar), my wrapper script looks as follows:

#! /bin/bash

sudo pkill -kill -f "sshfs"
sudo umount -f /mnt/<name>
sudo umount -l /mnt//<name>
sudo umount -a -t cifs -l
sudo bash `<name of mount file`.sh

Some operations in this file may be redundant or ineffective.

So in summary, I call my wrapper script which does the following:

  1. Unmount all mounts
  2. Mount all mounts specified in <name of mount file>.sh.

5.2 Command-line access (Terminal)

5.2.1 SSH setup

Again we use ssh, this time to log into remote servers rather than downloading Github repos. If you have never set a “ssh keygen pair” at your local machine, please do so by calling ssh-keygen -t rsa.

If you already have a file named id_rsa.pub in your ~/.ssh folder at your local machine, skip this step! Otherwise it will override your existing one and may invalidate previous ssh connections you set up. You now have an id_rsa.pub file in a (hidden!) folder named .ssh within /home (at your local machine). Now you need to copy this file (id_rsa.pub) to the server so that you can be identified:

ssh username@<server ip> 'test -d ~/.ssh && mkdir ~/.ssh' # creates the .ssh directory if it does not exist
scp .ssh/id_rsa.pub username@<server>:/home/<username>/.ssh/ # copies your local public key to the server

Every time you log in via command line now, you will not be prompted for your password.

You can further simplify the login process. Instead of having to type ssh <user>@<server> you can store all of this information in your ~/.ssh/config file:

Host <name-you-wanna-use>
    user <username>
    Hostname <server>
    IdentityFile /path/to/.ssh/id_rsa

You can easily connect to all servers you have access to with a little one-time effort.

5.2.2 Using tmux and tmuxinator for terminal automatization

tmux is a terminal multiplexer that lets you create complex terminal arrangements. Additionally, you can use keybindings to quickly move between panes and a lot of extensions exists to save and restore layouts.

One extension, namely tmuxinator gives you the power to write template files for server connections. This makes it possible to load several server connections with just one command.

For example, I currently have connections to six different servers stored in my config file. Executing tmuxinator start servers opens 6 windows with 3 panes each for each server. Here is a screenshot:

7. Laptop battery life optimization

Although the Linux kernel has a lot of power saving options, they are not all enabled by default.

There are two main power optimization tools:

  • Powertop
  • TLP

I prefer tlp as powertop often causes trouble with USB devices going into sleep mode. Also, applying the changes on boot is easier with tlp.

Do trizen -S --noconfirm tlp and then follow the instructions on TLP - ArchWiki to configure it correctly. powertop though is useful to check the applied settings. Do sudo powertop and go to the “tunables” section and check if most settings are “GOOD” (most are “BAD” before applying tlp).

8. Miscellaneous

8.1 Backup your config files and settings

A machine can crash any time. It is not only important to backup your data and scripts (or to have them stored in the cloud). You also want to backup all configurations of your apps so that you can restore them easily with a click. This is also important and useful if you want to sync all your configurations across multiple machines.

mackup is a great tool for this. It syncs a variety of config files and uploads these to a cloud of your choice. Under the hood, the config files will get soft-linked to your cloud which means that updating it on one machine will also trigger a change on all other machines.

On a new machine, you only need to run mackup restore to have all your settings back. Unfortunately it does not work with Windows but if you are reading this guide, you are most likely not on Windows anymore ;-).

8.1 arara

GitHub - cereda/arara: arara is a TeX automation tool based on rules and directives. An automatization tool for TeX: pac install arara-git. However, lately I use the latex-workshop extention in Visual Studio Code for all my LaTex stuff.

8.2 latexindent.pl: Required perl modules

latexindent is a library which automatically indents your LaTeX document during compilation: GitHub - cmhughes/latexindent.pl

trizen -S perl-log-dispatch perl-dbix-log4perl perl-file-homedir perl-unicode-linebreak.

It’s also integrad in the latex-workshop extention in Visual Studio Code.

8.3 Editor schemes

I use the Dracula scheme in almost all applications. While its comes integrated into RStudio, here are installation instructions for Kate and Tilix. Alternatively I like the tomorrow-night-eighties theme lately.

8.4 Fonts

I enjoy using Fira Code. I use it as a coding font in all editors (monospace ftw) but also as a system wide font (the “medium” variant) with size 10. Another great monospace coding font is Iosevka.

8.5 Icon themes

There are two awesome icon themes: Papirus and numix.

Try them and choose for yourself. You will see what a tremendous impact good icons can have on your daily work.

8.6 Desktop Themes

8.6.1 KDE

My overall desktop theme favorite is “Adapta”. Set it via “System Settings -> Workspace theme -> Desktop Theme”. For “Look and Feel” I prefer “Arc Dark”.

To install these, simply click on “Get new looks” on the bottom right when you are in “System Settings -> Workspace theme”.

8.6.2 GNOME

KDE apps in GNOME (and the other way round) usually have an odd appearance because they are powered by different graphical libraries. To make KDE apps look acceptable in GNOME, do the following:

  1. Install qt5ct
  2. Set the environment variable QT_QPA_PLATFORMTHEME to "qt5ct" (add set -gx QT_QPA_PLATFORMTHEME "qt5ct" in .config/fish/config.fish).
  3. Run qt5ct and change the settings to your liking. Note: The default GNOME font is “Cantarell Regular 11pt”. To select the default KDE “Adwaita” theme, you need to install it first: trizen -s adwaita-qt5.

Now you can enjoy KDE apps such as Dolphin or Okular. However, you have to start them from the command line. A convenient workaround is to autostart them at boot. Create a file called .config/autostart/dolphin.desktop with the following content:

[Desktop Entry]
Name=dolphin
Comment=Run dolphin
Exec=dolphin
Terminal=false
Type=Application⏎

8.7 Presentations

To create presentations I use the R package xaringan. Usually I convert the resulting HTML slides to PDF using decktape (install with trizen -s nodejs-decktape) and present the talk using impressive (install with trizen -s impressive).

8.8 Touchpad drivers

Some devices need to install the “synaptic touchpad drivers” to enable the a “soft-click” window activation. trizen -s xf86-input-libinput.

8.9 Other helpful tools

  • pacmanity: Each time you install a package, this tool adds the package to a Github Gist. This Gist stores all of your installed packages. Here’s mine.
Avatar
Data Scientist

Related

Next
Previous
comments powered by Disqus