R Basics: Weather Data

This tutorial relies on data provided as supplementary material for the paper Slingsby et al. 2017. Intensifying postfire weather and biological invasion drive species loss in a Mediterranean-type biodiversity hotspot. PNAS. http://dx.doi.org/10.1073/pnas.1619014114. The data can be downloaded here. It’s a ~13MB .xlsx file.

The study presents the results of 44 years of observation of permanent vegetation plots in the Cape of Good Hope Section of Table Mountain National Park, South Africa. The plots are dubbed “The Taylor Plots” in honour of Hugh Taylor, who established them in 1966. We’ll describe the details of the data as we work through the different tutorials, but for this first one we will only use the rainfall and temperature data for the reserve.

First, a few R basics

R can do basic calculations

1 + 1

## [1] 2

But to do more complex calculations, its easier to work with objects, which you assign, e.g.

x <- 1 
x

## [1] 1

You can assign objects from other objects - usualy once you’ve performed an operation on them

y <- x + 1
y

## [1] 2

You can perform operations with multiple objects

y + x

## [1] 3

Combine objects

z <- c(x, y)
z

## [1] 1 2

Or apply functions to objects

sum(z)

## [1] 3

You can also access elements within an object using indexing with square brackets

z[1] #the first element in vector z

## [1] 1

z[2] #the second element in vector z

## [1] 2

Or evaluate them using logical expressions

z == 1

## [1]  TRUE FALSE

z < 1

## [1] FALSE FALSE

z > 1

## [1] FALSE  TRUE

Note that you can reassign (or overwrite) objects, so you need to use unique object names if you want to keep them

x <- sum(z)
x

## [1] 3

R keeps these objects in memory

ls() #function to call the list of objects in memory - and "#" lets you put comments like this in your code...

## [1] "x" "y" "z"

This is useful, but if the objects are large, it will slow down the speed of calculations, so it is good practice to discard (large) objects once you’re done with them (or only read them in when you need to)

rm(x, y, z)
ls()

## character(0)

Now let’s explore real data!

Reading in data

First we’ll call the packages/libraries that contain the functions we need

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readxl)

Now we need some data.

When reading data from a file into R you need to know the system path to the file on your computer, e.g. "/home/jasper/GIT/academic-kickstart/static/datasets".

Typing the whole thing out (or even copying and pasting) can get tedious, so its often better to set your working directory so R knows where to look for files, like so.

setwd("/home/jasper/GIT/academic-kickstart/static/datasets")

If you are trying to run this code on your own, you’ll get an error when trying to set your working directory to a folder on my computer :) - You need to change the line of code above to the folder where you put the data. If you’re on Windows, you’ll need to specify the route drive, e.g. C:. You also need to make sure to use single forwardslashes / or double backslashes \\, rather than the single backslash \ that is the default on Windows.

So my file path on windows would look like:

"C:/jasper/GIT/academic-kickstart/static/datasets"

"C:\\jasper\\GIT\\academic-kickstart\\static\\datasets"

Or on Mac it would be:

"/Users/jasper/GIT/academic-kickstart/static/datasets"

Lastly, if you’ve tried all that and it’s still not working try adding (or removing) slash(es) at the end.

We can check if we got it right using

getwd()

## [1] "/home/jasper/GIT/academic-kickstart/static/datasets"

Ok, once all that’s done, we can use R to have a look at what files are in that directory

list.files()

## [1] "pnas.1619014114.sd01.xlsx"        "SAEON_field instrument list.xlsx"

And in this case we want to read data from pnas.1619014114.sd01.xlsx, but given that this is an excel workbook, it may have multiple separate worksheets of data. Let’s have a quick glimpse the lazy way

excel_sheets("pnas.1619014114.sd01.xlsx")

##  [1] "METADATA"           "weather"            "fires"             
##  [4] "postfireweather"    "enviroment"         "excluded_spp"      
##  [7] "veg1966"            "veg1996"            "veg2010"           
## [10] "traits"             "speciesclimatedata"

There are 11 different sheets. Let’s start with the weather

weather <- read_excel("pnas.1619014114.sd01.xlsx", sheet = "weather")

## New names:
## * `` -> ...1

#Let's have a look at the first few lines?
head(weather) #

## # A tibble: 6 x 6
##    ...1 Date                 Temp  Rain Month  Year
##   <dbl> <dttm>              <dbl> <dbl> <dbl> <dbl>
## 1     1 1961-01-02 00:00:00  21     0       1  1961
## 2     2 1961-01-03 00:00:00  20.6   0       1  1961
## 3     3 1961-01-04 00:00:00  15     3.5     1  1961
## 4     4 1961-01-05 00:00:00  21     0       1  1961
## 5     5 1961-01-06 00:00:00  19.8   0.8     1  1961
## 6     6 1961-01-07 00:00:00  22     0       1  1961

or last few lines

tail(weather)

## # A tibble: 6 x 6
##    ...1 Date                 Temp  Rain Month  Year
##   <dbl> <dttm>              <dbl> <dbl> <dbl> <dbl>
## 1 17474 2009-11-25 00:00:00  15       0    11  2009
## 2 17475 2009-11-26 00:00:00  18.9     0    11  2009
## 3 17476 2009-11-27 00:00:00  16.9     0    11  2009
## 4 17477 2009-11-28 00:00:00  22.2     0    11  2009
## 5 17478 2009-11-29 00:00:00  19.2     0    11  2009
## 6 17479 2009-11-30 00:00:00  17.3     0    11  2009

There are 17476 rows of data, so it’s a bit tricky to inspect it all! Fortunately we can call a summary like so

summary(weather)

##       ...1            Date                          Temp      
##  Min.   :    1   Min.   :1961-01-02 00:00:00   Min.   : 6.30  
##  1st Qu.: 4371   1st Qu.:1973-12-19 18:00:00   1st Qu.:15.00  
##  Median : 8740   Median :1985-12-09 12:00:00   Median :17.50  
##  Mean   : 8740   Mean   :1985-12-05 08:31:36   Mean   :17.62  
##  3rd Qu.:13109   3rd Qu.:1997-12-07 06:00:00   3rd Qu.:20.00  
##  Max.   :17479   Max.   :2009-11-30 00:00:00   Max.   :35.30  
##       Rain             Month             Year     
##  Min.   : 0.0000   Min.   : 1.000   Min.   :1961  
##  1st Qu.: 0.0000   1st Qu.: 4.000   1st Qu.:1973  
##  Median : 0.0000   Median : 7.000   Median :1985  
##  Mean   : 0.9914   Mean   : 6.514   Mean   :1985  
##  3rd Qu.: 0.2000   3rd Qu.:10.000   3rd Qu.:1997  
##  Max.   :97.0000   Max.   :12.000   Max.   :2009

Nice, but this doesn’t help see trends etc. It’s difficult to eyeball daily rainfall so let’s calculate the annual totals

Data wrangling

Here we use the group_by() and summarise() functions provided by library(dplyr) to calculate the annual rainfall totals using sum().

annualrain <- weather %>% group_by(Year) %>% summarise(Rainfall = sum(Rain))

head(annualrain) #to see what we get

## # A tibble: 6 x 2
##    Year Rainfall
##   <dbl>    <dbl>
## 1  1961     382 
## 2  1963     304.
## 3  1964     418.
## 4  1965     403.
## 5  1966     389 
## 6  1967     371.

If you were to translate this line of code into English it would say:

Take the weather dataframe, and for each Year, sum the total rainfall across all observation (months in this case).

library(dplyr) and other tidyverse packages contain a series of functions that can be thought of as verbs, providing the grammar for you develop code statements, e.g.:

arrange()
filter()
mutate()
slice()
summarise()
group_by()

(Pro tip: Do yourself a favour and apply the fourth law here!)

Binding the verbs together into code sentences is made possible by the pipe %>% function, which essentially allows you to pass the output of one function (or verb) to the next without making a new object in memory each time.

i.e. without %>% this code would have to be written

step1 <- group_by(weather, Year)
step2 <- summarise(step1, Rainfall = sum(Rain))
head(step2)

## # A tibble: 6 x 2
##    Year Rainfall
##   <dbl>    <dbl>
## 1  1961     382 
## 2  1963     304.
## 3  1964     418.
## 4  1965     403.
## 5  1966     389 
## 6  1967     371.

Which creates the unwanted objects like step1, which just waste memory, so let’s get rid of them

rm(step1, step2)

So looking at annual totals is easier to interpret than daily data, but not as good as a graph…

Plotting data with `library(ggplot2)`

To plot the data using the functions in library(ggplot2) we first describe the data, telling ggplot what data object we want to use and which variables should be our x and y axes, like so.

r <- ggplot(annualrain, aes(x = Year, y = Rainfall))

Then we can play with different plot types

r + geom_line() #A line graph - as easy as that!!!

I find time-series are often easier to look at if you shade one side, like so

r + geom_area()

Which is obviously easier to interpret than a scatterplot of points

r + geom_point()

Unless you add a trend line, like a loess spline

r + geom_point() + geom_smooth(method = 'loess')

## `geom_smooth()` using formula 'y ~ x'

So there have been spells of wetter and dryer years, but no clear trend. What about if we wanted to see if there’ve been any changes in seasonality? I.e. particular months show patterns of getting wetter or dryer?

More data wrangling and plotting

To summarise the data by month is very easy, all we need do is add Month to the group_by() statement in the code we used to get the annual totals.

monthlyrain <- weather %>% group_by(Year, Month) %>% summarise(Rainfall = sum(Rain))

## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.

head(monthlyrain)

## # A tibble: 6 x 3
## # Groups:   Year [1]
##    Year Month Rainfall
##   <dbl> <dbl>    <dbl>
## 1  1961     1     44.7
## 2  1961     2      5.7
## 3  1961     3     11.9
## 4  1961     4      9.4
## 5  1961     5     45.2
## 6  1961     6     86.4

The only issue now is that it is difficult to interpret the table directly, and we have 3 variables, so it’s not as simple to plot.

Fortunately, library(ggplot2) has the functions facet_grid() and facet_wrap() that make it very easy to make multi-panel plots. Here we use the same code as the line plot above, but apply it to the object monthlyrain and add facet_grid() indicating that the rows of the mutiplot should be the months.

r <- ggplot(monthlyrain, aes(x = Year, y = Rainfall))

r + geom_line() + facet_grid(rows = vars(Month))

Nice! We can clearly see that months 4 to 8 (April to August) are the wetter months in this region, but its still difficult to tell which years were wetter or drier relative to the long term average. To do this we need to calculate the anomalies (i.e. the differences from the mean).

Getting the mean for each month is very easy to do by adapting our code above and applying it to the monthlyrain dataframe we have already calculated. We simply change the function passed to summarise() from sum() to mean(), and remove Year from the group_by() statement.

i.e we’re saying

Take the monthlyrain dataframe, and for each Month, calculate the average rainfall across all years.

like so

monthlymean <- monthlyrain %>% group_by(Month) %>% summarise(MeanMonthlyRainfall = mean(Rainfall))

head(monthlymean)

## # A tibble: 6 x 2
##   Month MeanMonthlyRainfall
##   <dbl>               <dbl>
## 1     1                11.4
## 2     2                13.7
## 3     3                13.4
## 4     4                32.6
## 5     5                47.6
## 6     6                58.7

Ok, but now we need to match the MeanMonthlyRainfall with each month in monthlyrain. This is very easy with the function merge().

monthlyrain <- merge(monthlyrain, monthlymean)

head(monthlyrain)

##   Month Year Rainfall MeanMonthlyRainfall
## 1     1 1961     44.7              11.425
## 2     1 1999      5.2              11.425
## 3     1 1989      6.4              11.425
## 4     1 1979     14.8              11.425
## 5     1 2006      0.7              11.425
## 6     1 1996      2.3              11.425

Ok, now we want to calculate the positive and negative differences from the mean. For plotting purposes we’ll do this as separate Positive and Negative columns. R makes this easy, by allowing simple mathematical operations and allowing us to access or assign new columns using the $ operator, like so.

monthlyrain$Positive <- monthlyrain$Rainfall - monthlyrain$MeanMonthlyRainfall 

head(monthlyrain)

##   Month Year Rainfall MeanMonthlyRainfall Positive
## 1     1 1961     44.7              11.425   33.275
## 2     1 1999      5.2              11.425   -6.225
## 3     1 1989      6.4              11.425   -5.025
## 4     1 1979     14.8              11.425    3.375
## 5     1 2006      0.7              11.425  -10.725
## 6     1 1996      2.3              11.425   -9.125

Now we have negative values in our Positive column, but that’s ok, because we can set them to zero using a logical operator that detects negative values, like so:

monthlyrain$Positive[which(monthlyrain$Positive < 0)] <- 0

head(monthlyrain)

##   Month Year Rainfall MeanMonthlyRainfall Positive
## 1     1 1961     44.7              11.425   33.275
## 2     1 1999      5.2              11.425    0.000
## 3     1 1989      6.4              11.425    0.000
## 4     1 1979     14.8              11.425    3.375
## 5     1 2006      0.7              11.425    0.000
## 6     1 1996      2.3              11.425    0.000

Ok, there were a few steps that went into that. Let’s break it down:

the logical expression < returns TRUE or FALSE for whether the values in monthlyrain$Positive are less than zero or not;
the which() function translates all the values that are TRUE to their position in the vector/column monthlyrain$Positive;
the square brackets (or indexing) allows us to call all those values in the vector/column monthlyrain$Positive; and
assign them the value zero with <- 0

Now we do the same for the Negative values:

monthlyrain$Negative <- monthlyrain$Rainfall - monthlyrain$MeanMonthlyRainfall
monthlyrain$Negative[which(monthlyrain$Negative > 0)] <- 0

head(monthlyrain)

##   Month Year Rainfall MeanMonthlyRainfall Positive Negative
## 1     1 1961     44.7              11.425   33.275    0.000
## 2     1 1999      5.2              11.425    0.000   -6.225
## 3     1 1989      6.4              11.425    0.000   -5.025
## 4     1 1979     14.8              11.425    3.375    0.000
## 5     1 2006      0.7              11.425    0.000  -10.725
## 6     1 1996      2.3              11.425    0.000   -9.125

And we’re ready to start plotting. I’ll build this up piece by piece. First we’ll set up the plot and add the positive anomaly as a “ribbon” plot

g <- ggplot(monthlyrain, aes(x = Year))

g <- g + geom_ribbon(aes(ymin = 0, ymax = Positive, fill="Positive"))

g

Now we’ll add the negative anomaly ribbon

g <- g + geom_ribbon(aes(ymin = Negative, ymax = 0 ,fill="Negative"))

g

Next we can set colours to make sure they make sense - e.g. positive (wetter) anomalies should be blue and negative (drier) should be red.

g <- g + scale_fill_manual(name="",values=c("#D6604D", "#4393C3"))

g

Now we can split it into a multipanel by month using facet_grid()

g <- g + facet_grid(cols = vars(Month))

g

And add a title, meaningful y-axis label and flip the axes so we can read the graph properly

g <- g + ggtitle("Monthly Rainfall Anomaly (1961-2010)") + coord_flip() + ylab("Rainfall (mm)")

g

Not much to observe other than the positive anomalies are typically bigger than the negative anomalies. This is not surprising, because the negative anomalies are bounded by zero (i.e. no rain is the driest it can get), while there are no constraints on the positive anomaly. For example, the wettest day on record had 97 mm.

What about summer anomalies?

Now see if you can adapt the code above to develop the same figure using the Temp data in the object weather.

Note that you’ll need to calculate monthly means, not sum the totals, and better to swap the colours so red is the positive anomaly (hotter) and blue the negative (cooler).

It should look like this:

## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.

This looks a bit more interesting, with fewer cool anomalies and more hot anomalies as we head towards the present. It turns out the temperature has risen by ~1.2 degrees over the observation period.

Last updated on May 11, 2021