Introduction to web scraping of judicial data in R
Creating a data set of UK Supreme Court decisions
The field of empirical legal studies holds great promise in its agenda to systematically analyse and scrutinise the work of courts. An initial step to this research tends to be the process of data gathering, which can depend on a variety of sources. One such source is the "scraping" of online sources: Writing dedicated computer programs to download and systematically organise data from the internet into data tables that lend themselves to empirical research.
Web scraping can be done using a variety of software and programming languages. In this guide I present a step by step guide to creating such a data base of the UK Supreme Court using the R programming language. My aim is to provide an entryway for scholars who wish to conduct empirical analysis of courts where there is not yet any published data available for researchers. I would also encourage researchers to publish data they collect, allowing for increasingly valuable comparative research into the work of courts.
If you are merely interested in accessing data about the UK Supreme Court, you can download the data directly from my website. The full script used to compile these data can be downloaded here.
While I have tried to write this guide in an accessible way, I cannot go into depth about the R programming language. If you are an absolute beginner and want to learn more, I recommend the free book R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.
This guide starts by providing a checklist before you start scraping data from online sources, before moving on to the example of the UK Supreme Court in section 2. This section begins by loading a simple table from the web site into into a data frame in R, before moving on to scraping and extracting more detailed case-level data. Section 3 moves on to data management, illustrating how raw variables from the web can be converted into more practical formats employable for data analysis. Section 4 provides some illustrations of how the data can be analysed, drawing from the gender of the justices at the Supreme Court. Section 5 concludes.
Table of contents:
- Checklist before starting
- Scraping the UK Supreme Court
2.1. Download a list of cases
2.2. Scrape case-level data
2.3. Extracting case-level data - Data management
3.1. Converting dates
3.2. Managing text data
3.3. Aggregating data to judgment level - Analysing data
- Conclusion
1. Checklist before starting
The first step of any data collection effort using web scraping is, of course, to identify which data to collect. In this example I turn to the web site of the Supreme Court of the United Kingdom, which lists certain metadata of cases going back to 2009. A quick peak around the website shows that the information published by the UK supreme court got much richer starting in 2020, with recent decisions including text data (summary of facts and the issue), information about the parties to the case, and important dates. The justices responsible for the decision seems to have been listed consistently since the first entry in October 2009.
Having familiarised ourselves with the website of the Court, we turn to two important questions:
-
Is the data readily available elsewhere?
and
-
Is it legal to collect data from the website using web scraping?
In response to the first point, a quick web search of "UK supreme court data" returns the BAILII databases, which include the text of decisions in UK courts (and which is linked to by the UK Supreme Court itself) and judiciary.co.uk, which presents a user-friendly overview of courts decisions. There does not, however, appear to be an easy way to download data from any of these sources. Although there might be another database published somewhere that I am unaware of, and one should probably do more thorough research before starting a serious data gathering effort, it appears this data is not easily available for download at the time of writing.
This brings us to the second question: Would it even be legal to download data from the UK Supreme Court?
Looking into the terms and conditions of the Court's website, it seems this would present no issue. The automatic indexing of the website is not clearly addressed, and regarding copyright the terms and conditions state that:
You may use and re-use Crown copyright material from this website, other than the Royal Arms and departmental or agency logos, under the terms of the Open Government Licence, provided it is reproduced accurately and not in a misleading context.
Researchers should therefore be able to not only download data from the UK Supreme Court, but also to publish it in an academic context.
2. Scraping the UK Supreme Court using R
While all one would really need in order to follow this guide is a working installation of R, most R users prefer working within RStudio. I would recommend anyone new to R to do the same, unless you have strong preferences for other software. Once you have a piece of code open in RStudio you can run it by highlighting it and pressing ctrl+enter. Lines beginning with hashtags (#) are comments, and will not be executed.
We begin by loading the packages I use in this guide: rvest
, stringr
, ggplot2
, and tibble
. If this is your first time using these packages, you will need to install them first using the install.packages
command.
# # Uncomment the following lines and run them to install packages:
# install.packages("rvest") # web scraping
# install.packages("stringr") # editing strings
# install.packages("ggplot2") # creating graphs
# install.packages("tibble") # user friendly data frames
library(rvest)
library(stringr)
library(ggplot2)
library(tibble)
Having successfully installed R and loaded these packages, we are ready to begin gathering data!
This guide is written in 2024. A problem with web scraping, or really the web in general, is that it is subject to change: The code below might very well stop working if the UK Supreme Court should decide to change their web site. The wonderful Internet Archive provides a snapshot of the website of the UK Supreme Court at the time of writing, which might serve as a useful reference point should this ever happen.
2.1. Download a list of cases
When writing a web scraping script we have to instruct the computer to read the website much in the same way as we would as human readers. In this case this involves first observing the list of all cases, identifying the relevant links provided within it, and then follow these links in order to retrieve more information. The cases from 2023 are located at https://www.supremecourt.uk/decided-cases/2023.html
, providing a natural starting point:
url <- "https://www.supremecourt.uk/decided-cases/2023.html"
page <- read_html(url)
With these two lines of code, we have now downloaded this site into an R object, similar to how a web browser would load it if you opened it normally. Running message(page)
in R will return the exact same output in your R console as opening the URL in a web browser, right clicking anywhere, and choosing "view source". In most web browsers you should be able to see this directly by entering view-source:https://www.supremecourt.uk/decided-cases/2023.html
in the URL field.
When preparing ourselves to retrieve information, we want to familiarise ourselves with the web site as it looks in its source code. A useful trick is to find an unique piece of text in the graphical view, and search for it (ctrl+f) in the source code in order to quickly find its location.
The source of the website is stored as an XML document, which allows for the use of a variety of powerful context analysis tools. In this guide I use XPath, which is a type of computer language designed to work with these files. You can read more about XPath on Wikipedia, or familiarise yourself with it using a cheatsheet from devhints.io.
In our case, the Court presents its list of cases within a table (tagged in HTML as <table>
), providing a good starting point for us to identify its content. This also happens to be the only table presented on the site, making it easy to identify using the html_elements
function from the rvest
package. Our job is made even easier by the html_table
function, which allows us to load an HTML table directly into a data frame in R after isolating its source code.
# extract all <table> tags observed within the document
table_code <- html_elements(page, xpath="//table")
# If there is more than one table, this produces a list of table
# elements.
# If we want to see the source code of table in the console, we can
# run the following:
message(table_code)
# We can now use html_table to read the table from source,
# extracting only the first returned element:
table <- html_table(table_code)[[1]]
The table
element is now a perfectly useful R data frame of 54 rows and four columns, containing the date, citation, case ID, and case name similar to what is presented on the Court's web site. That's already good, but of course we would like the data to get some more flesh on its bones.
An observant reader will already have observed a bit of a challenge in the data we just successfully downloaded: While each row of the table refers to a single judgment of the Court, each judgment might respond to a series of cases. Judgment UKSC 42 of 15 November 2023, for example, is written in response to six separate cases, with separate metadata for each one.
It is therefore necessary to separate out metadata on the case level. We do this by first transforming the "Case ID" column into a machine-readable list.
# Transform case ID variable to be a proper list
table$ID_list <- table$`Case ID` |>
str_replace_all("(.\\S)\\s*(UKSC)", "\\1&\\2") |> # Divide by & when there are multiple cases
str_split("&") # Split at &
Managing multi-level data and the management of unique ID codes can be a painful exercise. Readers who attempt to expand the data on the UK supreme court beyond 2023 will encounter observations where a single judgment appears to have been given multiple ID codes, and observations where cases have simply been assigned the wrong ID by the Court.
Striving to handle ID codes in a sensible and organised way is always worth the effort: Unique ID codes for each observation forms the very foundation of your data.
A second challenge is that some "neutral citations", notably [2023] UKSC 11 and [2023] UKSC 5, appear twice. It seems they are indeed the same judgment, so I assume it is safe to merge them. I store this new data frame, with duplicated observations removed, in a new object referred to as judgments
.
# Remove all duplicated rows and create "judgments" data frame
judgments <- table[!duplicated(table$`Neutral citation`),]
# Insert case numbers from table data frame to prevent loss of data
# in duplicated cases
judgments$ID_list <-
lapply(judgments$`Neutral citation`, function(y)
unique(unlist(table$ID_list[which(table$`Neutral citation` == y)]))
)
# Now that we've streamlined the data, the Case ID column is no
# longer accurate. We can update it:
judgments$`Case ID` <- sapply(judgments$ID_list, function(y) paste(y, collapse=", "))
These sorts of quirks will of course vary from tribunal to tribunal, and the correct solution to such problems tends to be a question of substance rather than of methodology.
Now that we have a list of case numbers and their associated judgments, we are ready to compile a data set on the case level.
# Create a "flat" list of all case numbers, as opposed to the
# current "list of lists"
list_of_cases <- unlist(judgments$ID_list)
# Create a data frame on the case level
# I include a few variables I know I want to fill in later on.
cases <- tibble(id = list_of_cases,
citation = NA,
summary_issue = NA,
summary_facts = NA)
# Loop through all cases and add their "neutral citation" from the
# original table
for(i in cases$id){
cases$citation[which(cases$id == i)] <-
judgments$`Neutral citation`[grep(i, judgments$`Case ID`)]
}
Having prepared a case-level data set, we can now go on to collect data on the level of each individual case.
2.2. Scrape case-level data
When presenting its case-level data, the Court is helpful enough to build its URLs upon case IDs. As a result, we do not have to extract the links from the source code of the Court's web site, but we can instead reconstruct them based on ID numbers. For example, the URL for case 2022/0149 is https://www.supremecourt.uk/cases/uksc-2022-0149.html
.
In order to scrape through the Court's information on the case level, we first identify the base of the URL (https://www.supremecourt.uk/cases/
). We then reconstruct the part that varies (uksc-2022-0149
) based on case numbers.
url_base <- "https://www.supremecourt.uk/cases/"
cases$web_id <- cases$id |>
str_replace_all("\\s|/", "-") |> # Replace spaces (\\s) or (|) slash (/) with dashes
tolower() # make lowercase
We now have what we need in order to reconstruct the URLs to all cases in the data frame. This means we are ready to scrape case-level data.
Normally, multiple web pages would be scraped by running a loop. A demo of how this works is presented below, printing the URL of each iteration into the console:
for(i in cases$web_id){
# Define URL:
url <- paste0(url_base, i, ".html")
# Return a message of "web id" and the URL in the console, to
# verify that the loop is running:
message(i, ": ", url)
# Sleep for 1/10th of a second in order to illustrate how the
# loop gradually works through all the cases.
Sys.sleep(.1)
}
The code presented in the rest of this section can fruitfully be placed within a similar loop function. Similarly, all the code in this guide since the initial definition of the URL could have been placed within a loop, observing not only 2023 but all years in which the Court has published data. This would look something like the following:
for(y in 2009:as.numeric(format(Sys.Date(), "%Y"))){
url <- paste0("https://www.supremecourt.uk/decided-cases/", y, ".html")
# page <- read_html(url) # Commented out to avoid unnecessary scraping
# # the remainder of the guide included here, making sure
# # not to override data when scraping a new year
Sys.sleep(2) # Take a break between each iteration of the loop
}
For the sake of simplicity I am omitting the loop in this guide and instead define i
as the first observed value in cases$web_id
. This can be a useful way of writing the code, before wrapping it in a loop later when the code has been found to work with multiple potential values of i
.
Having i
defined as the web ID that constitutes an URL together with the base and .html
, we can scrape the web site in the same way as earlier.
# I define i as the web ID of the first row in the cases data set
i <- cases$web_id[1]
# You would normally begin the loop here instead:
# for(i in cases$web_id){
# We create a full URL by pasting together the base, ID number, and "html"
url <- paste0(url_base, i, ".html")
# and then we read it into object "page":
page <- read_html(url)
# We also define an object x, which refers to the row in the cases
# data set corresponding to the current observation. As long as i
# is defined as the first value in cases$web_id, x will always be 1.
x <- which(cases$web_id == i)
The complexity of the case-level data is, however, greater than that we observed in the judgment-level table we scraped in section 2.1. When working on this it is necessary to pay close attention to the source code of the page that is scraped.
2.3. Extracting case-level data
First, we want to observe the title of the case. This can be easily achieved by looking to the <title>
tag in the source code, in which the Supreme Court always lists the title of the case followed by " - The Supreme Court". We simply extract the text of the title and remove this ending, and we are left with the title of the case.
# case title
cases[x, "title"] <- page |>
html_elements(xpath="//title") |>
html_text() |>
str_remove("\\s?- The Supreme Court") # Sometimes the first space is missing
Second, we want to extract the text presented under the "Issue" headline. We see that the information is presented in a <h4>
tag, as <h4>Issue</h4>
. The following headline is <h4>Facts</h4>
, meaning that we want to observe the html_elements
appearing between these two headlines.
While there are many ways of achieving this, we can use XPath to help us identify all HTML tags preceded by the Issue headline and succeeded by the Facts one:
issue <- page |>
html_elements(xpath='//*[preceding-sibling::h4[. = \'Issue\']
and following-sibling::h4[. = \'Facts\']]')
In the above code, //*
is used to refer to any HTML tag. If we want to specify that we are only interested in text paragraphs (<p>
), we can replace this with //p
. Similarly, if "Issue" and "Facts" could appear in other tags than <h4>
, we can replace h4
with a single dot (.
). As mentioned earlier, devhints.io has a useful cheatsheet that helps write and make sense of XPaths.
When looping this through all observations, we find that the issues are not observed in all judgments. The reason for this is that the headline can sometimes be "Issues" rather than "Issue". A quick and easy, if not very elegant, fix is to add an if-statement: If the length of issues (the previously retrieved element) is zero (meaning nothing was found), we look for the headline "Issues" instead.
# Sometimes the headline is "Issues"
if(length(issue) == 0){
issue <- page |>
html_elements(xpath='//*[preceding-sibling::h4[. = \'Issues\']
and following-sibling::h4[. = \'Facts\']]')
}
Having now retrieved the issues and controlled that it works well for all observations, we can do the same with the facts. We face a similar challenge, where headline following the facts could either be "Judgment appealed" or "Parties", depending on the observation. Again, an if-statement does the trick.
facts <- page |>
html_elements(xpath='//*[preceding-sibling::h4[. = \'Facts\']
and following-sibling::h4[. = \'Judgment appealed\']]')
if(length(facts) == 0){
facts <- page |>
html_elements(xpath='//*[preceding-sibling::h4[. = \'Facts\']
and following-sibling::h3[. = \'Parties\']]')
}
Note that the issues and facts are extracted as XML nodesets. In order to insert them into the cases
data frame, we can convert their content into lists.
# Convert text variables into list objects:
cases[[x, "summary_issue"]] <- list(html_text(issue))
cases[[x, "summary_facts"]] <- list(html_text(facts))
Moving on, we're interested in the various metadata listed on the website: Justices, the dates of the hearing, and the date of the judgment. This information is presented inside paragraph tags (<p>
), preceded by <h4>
tags with the respective headlines. The justices can, for example, be listed as follows:
<h4>Justices</h4>
<p>Lord Reed, Lord Briggs, Lord Sales, Lord Burrows, Lord Stephens</p>
Below, we extract the first paragraph tag following each headline, containing the relevant metadata.
# Justices
cases[x, "justices"] <-
html_text(html_elements(page,
xpath='//p[preceding-sibling::h4[. = "Justices"]]')[1])[1]
# Hearing start date
cases[x, "date_hearing_start"] <-
html_text(html_elements(page,
xpath='//p[preceding-sibling::h4[. = "Hearing start date"]]')[1])[1]
# Hearing finish date
cases[x, "date_hearing_finish"] <-
html_text(html_elements(page,
xpath='//p[preceding-sibling::h4[. = "Hearing finish date"]]')[1])[1]
# Judgment date
cases[x, "date_judgment"] <-
html_text(html_elements(page,
xpath='//p[preceding-sibling::h4[. = "Judgment date"]]')[1])[1]
Moving on, we might also be interested in a direct link to the text of the judgment itself. In HTML, hyperlinks are made using the <a>
tag, and in the website of the UK Supreme Court, the links to the original document has the title "Find out more on the BAILII website". We can use the following code to identify these links on the page:
text_link <- page |>
html_elements(xpath='//a[@title = "Find out more on the BAILII website"]')
However, there might be several links matching this description. It seems we are always interested in the first one, which makes our job easier:
# Keep only the first observed BAILII link
text_link <- text_link[grep("Judgment on BAILII", html_text(text_link))][1]
# Insert URL (href) into table:
cases[x, "text_url"] <- html_attr(text_link, "href")[1]
We now have a data set containing the fact, issues, important dates, and justices of the case, as well as a link to the full case of the decision.
Sometimes websites detect systematic patterns in usage, as well as exaggerated traffic, in order to ban suspicious (non-human) activity. Adding breaks to the code is therefore not only a kind gesture, but could be necessary for your script to run properly.
In order to make your traffic less likely to be blocked, using random breaks rather than a set interval could better your chances further. The runif
command, which generates a random number within a lower and an upper limit, is useful in this regard. In the code below I use it to generate 1 number between 0.5 and 2: runif(1, .5, 2)
.
While it could often be useful not to be detected as a robot, do also make sure that you are complying with the terms and conditions of the website you are scraping! It is also important to be respectful of the server, and to not cause unnecessary strain on infrastructure. Excessive scraping can inflict significant costs on unprepared server operators, and one should not be punished for offering a public good. As with everything else in life, be respectful.
This data on one single case is sadly not very useful, and we would want to run this code within a loop compiling data of all the cases listed in the data table. If we are doing this, we need to also take caution not to cause too much of a strain on the servers of the Court. We can do this by inserting a short break between each iteration of the loop.
The following code needs only to run if used inside a loop:
# Give some updates to the user so we know that the code is running well
message(x, "/", nrow(cases), " - ", cases$title[x])
# Rest for between 0.5 and 2 seconds, to give the server a little break.
Sys.sleep(runif(1, .5, 2))
# This is where the loop would end
# }
3. Data management
Running the above code as a loop, we have now created a data frame of 11 columns and, in 2023, 63 rows and 52 different Supreme Court judgments. This is great, but the data could still be more refined in order to be easier to use: The dates are written in plain text, the justices are listed in text strings separated by commas.
3.1. Converting dates
A good starting point is to convert dates into a machine readable format: Rather than "24 October 2023", the preferred date format would be "2023-10-24". Thankfully this is an easy task in R using the as.Date
function and specifying the format as "%d %B %Y"
; %d
being the date, %B
the full name of the month, and %Y
the year using four digits. As long as dates are presented following this format, the following code should work just fine.
# We need to set the language of R to English for the following lines to work
Sys.setlocale("LC_TIME", "C")
# if you're having problems here, try the following:
# Sys.setlocale("LC_TIME", "English")
# Sys.setlocale("LC_TIME", "en_UK")
cases$date_hearing_finish <-
as.Date(cases$date_hearing_finish, format = "%d %B %Y")
cases$date_hearing_start <-
as.Date(cases$date_hearing_start, format = "%d %B %Y")
cases$date_judgment <-
as.Date(cases$date_judgment, format = "%d %B %Y")
Having converted the dates from text strings to proper dates, we can start running calculations with them. For example, we can identify the length of the hearing and the length of the procedure from the beginning of the hearing to the publication of the judgment.
cases$length_of_hearing <-
(cases$date_hearing_finish - cases$date_hearing_start)+1
cases$length_of_procedure <-
cases$date_judgment - cases$date_hearing_start
Looking at the length of procedure, we find that procedures before the UK supreme court vary greatly in length; some last just over a month, while the longest procedure with a judgment published in 2023 lasted over three years. We also, however, find one very strange observation: One of the procedures appears to have lasted -624 days.
# Observe minimum value of procedure length
min(cases$length_of_procedure)
# > Time difference of -624 days
# Check the ID of the case with this procedure length
cases$id[which(cases$length_of_procedure == min(cases$length_of_procedure))]
# > UKSC 2021/0144
Checking out this case on the website of the UK Supreme Court, we find that this is the result of a typo on the Court's website: Rather than listed as being published on 14 June 2023, the Court lists it as having been published on 14 June 2021.
As we know all the judgments in the data set are published in 2023, we can add an automatised solution to such problems by making sure the year of publication always matches the indexing of the Court. This is done in the following code, before the length of procedure is re-calculated:
# Find observations with the wrong year listed
bad_dates <- which(format(cases$date_judgment, "%Y") != "2023")
# Make sure you define the correct year!
# If looping throug y, remove this line.
y <- 2023
# Replace the wrong year with the correct one
cases$date_judgment[bad_dates] <- cases$date_judgment[bad_dates] |>
str_replace("\\d{4}", as.character(y)) |>
as.Date()
# Re-calculate the length of the procedure
cases$length_of_procedure <-
cases$date_judgment - cases$date_hearing_start
In this case we were lucky: The mistake in the Court's data was not only easy to identify, but it could also easily be replaced with the correct information. Oftentimes we will not be as lucky, and no data will ever be perfectly reliable.
This emphasises the need for empirical researchers to keep a critical view of their data, to always scrutinise extreme observations, and to rigorously check for the impact of outliers when conducting statistical research.
3.2. Managing text data
While date variables are relatively straightforward, text data opens up near infinite possibilities for scholars to manage and manipulate it. In this subsection I begin by taking a look at the list of justices, before I go on to show examples of variables generated based on the text of the summary of the case.
# Create "real" (machine readable) list of justices
cases$justices_list <- cases$justices |>
str_remove("\\s*$") |> # remove trailing spaces
str_split(", ")
Having the justices of each case presented as a proper list, we can easily observe the number of each panel. Furthermore, we can use the titles Lord/Sir and Lady/Dame to identify the gender of the justices.
# Count the number of justices
cases$justices_number <- sapply(cases$justices_list, length)
# Number of male and female justices
cases$justices_male <-
sapply(cases$justices_list, function(y)
length(grep("Lord|Sir", y)))
cases$justices_female <-
sapply(cases$justices_list, function(y)
length(grep("Lady|Dame", y)))
# Table of justices
table(unlist(cases$justices_list))
When looking at the table of justices, we find that Lord Lloyd-Jones is listed under two different names, as his name is written with two different types of dashes. I fix this before moving on to other text variables.
cases$justices_list <-
lapply(cases$justices_list, function(y) y |>
str_replace("–", "-"))
Another use of text data is to extract variables based on their content. The summary of facts and the issue can be useful in this regard, as it presents a rich summary of the case and covers the important details of it.
When working with text-based variables, we must always ask ourselves if we might encounter false positives or negatives. Maybe there are other keywords that should be included, or maybe a keyword we include can occur in a different context? Are we interested in migrating birds?
When collecting data for more years than only 2023, a large number of false negatives are generated by the simple fact that most observations lack a summary of facts and the issue completely. Using the code below, these will always be coded as FALSE
. This con be resolved by setting these variables to NA
when the text variables are missing.
We might, for example, be interested in whether a case relates to migration, human rights, or the European Union. We can identify these cases using a keyword search, assuming that the presentation of the issue will mention certain keywords should it relate to that matter. I present an example of this below, identifying migration cases (migration, migrant, asyl), human rights cases (human rights, ECHR, ECtHR), and the European Union (EU, European Union).
# Migration cases
cases$migration <-
grepl("migration|migrant|asyl", cases$summary_issue, ignore.case = TRUE) |
grepl("migration|migrant|asyl", cases$summary_facts, ignore.case = TRUE)
# Human rights cases
cases$human_rights <-
grepl("human rights|ECHR|ECtHR", cases$summary_issue, ignore.case = TRUE) |
grepl("human rights|ECHR|ECtHR", cases$summary_facts, ignore.case = TRUE)
# Cases mentioning the EU
cases$european_union <-
grepl("EU|European Union", cases$summary_issue, ignore.case = FALSE) |
grepl("EU|European Union", cases$summary_facts, ignore.case = FALSE)
Of 62 cases with judgments published in 2023, nine related to the European Union, nine to migration, and 15 to human rights.
One might also gather descriptive data about the Court's summaries themselves, such as their lengths in terms of their number of paragraphs or characters.
# Length in paragraphs
cases$facts_n_paragraphs <-
sapply(cases$summary_facts, length)
cases$issue_n_paragraphs <-
sapply(cases$summary_issue, length)
# Length in number of characters
cases$facts_n_characters <-
sapply(cases$summary_facts, function(y) sum(nchar(y)))
cases$issue_n_characters <-
sapply(cases$summary_issue, function(y) sum(nchar(y)))
This uncovers a wide variety in the length of the Court's summaries. While the shortest summary of facts in 2023 consists of just one paragraph and 427 characters, the longest observations consist of six paragraphs or 2865 characters. The shortest presentation of the issue is only one paragraph of 104 characters, while the longest have five paragraphs or 1399 characters. If this variation can be given a substantial interpretation remains, however, an open question.
3.3. Aggregating data to judgment level
While we so far have constructed detailed data on the case level, most observers might be more interested in the judgments of the Court than the cases before it. For this purpose we can aggregate data back up to the judgment level, feeding information back into the judgment data frame that was constructed earlier. The match
command is particularly useful in this regard, using the "neutral citation" as the ID code across units the two data frames.
# Dates
judgments$length_of_procedure <-
cases$length_of_procedure[match(judgments$`Neutral citation`,
cases$citation)]
judgments$length_of_hearing <-
cases$length_of_hearing[match(judgments$`Neutral citation`,
cases$citation)]
judgments$date_hearing_start <-
cases$date_hearing_start[match(judgments$`Neutral citation`,
cases$citation)]
judgments$date_judgment <-
cases$date_judgment[match(judgments$`Neutral citation`,
cases$citation)]
We can add a number of variables drawing from the data observed in the two data frames.
# number of joined cases
judgments$n_joined_cases <- sapply(judgments$ID_list, length)
# Related to the EU?
## 1. set all observations as NA
judgments$european_union <- NA
## 2. set all non-NA cases as FALSE
judgments$european_union[which(sapply(judgments$ID_list, function(y)
TRUE %in% (y %in% cases$id[which(!is.na(cases$european_union))])
))] <- FALSE
## 3. set judgments where any of the related cases are related to the EU as TRUE
judgments$european_union[which(sapply(judgments$ID_list, function(y)
TRUE %in% (y %in% cases$id[which(cases$european_union)])
))] <- TRUE
# Related to human rights?
judgments$human_rights <- NA
judgments$human_rights[which(sapply(judgments$ID_list, function(y)
TRUE %in% (y %in% cases$id[which(!is.na(cases$human_rights))])
))] <- FALSE
judgments$human_rights[which(sapply(judgments$ID_list, function(y)
TRUE %in% (y %in% cases$id[which(cases$human_rights)])
))] <- TRUE
# Length of issues
judgments$length_issue <-
sapply(judgments$`Neutral citation`, function(y)
mean(cases$issue_n_characters[which(cases$citation == y)])
)
# Length of facts
judgments$length_facts <-
sapply(judgments$`Neutral citation`, function(y)
mean(cases$facts_n_characters[which(cases$citation == y)])
)
# Number of justices
judgments$justices_number <-
cases$justices_number[match(judgments$`Neutral citation`,
cases$citation)]
# Binary variable for chamber larger than five justices
judgments$big_chamber <- judgments$justices_number > 5
# Binary variable observing whether a female justice was present
judgments$female_justice <-
judgments$`Neutral citation` %in%
cases$citation[which(cases$justices_female != 0)]
# Alternative formulation. Note that this code would need to be
# expanded if there was ever an all-female chamber.
judgments$gender_balance <-
ifelse(judgments$female_justice, "Mixed gender", "Male only")
# Categorical variable observing if the hearing is short (one day),
# medium (two days), or long (three or more days)
judgments$hearing_length <-
ifelse(judgments$length_of_hearing == 1, "short",
ifelse(judgments$length_of_hearing == 2, "medium",
"long"
))
# Binary variables for short and long hearings:
judgments$short_hearing <- judgments$hearing_length == "short"
judgments$long_hearing <- judgments$hearing_length == "long"
This produces a good starting point for studying the work of the UK supreme court. More work can of course be done to gather more information: Not least by following the link to the full text of the judgment and getting one's hands dirty with text analysis of the full decisions of the Court.
4. Analysing the data
Now that the data has been gathered, we can move on to the fun part - analysing it!
We might, for example, ask some critical questions about the gender balance at the Supreme Court. It's maybe not a big surprise that only 35% of the judgments published in 2023 had a female justice present in the chamber:
# Take a look at a table:
table(judgments$gender_balance)
# > Male only Mixed gender
# > 34 18
# Calculate the percentage of judgments with at least one female
# justice in the panel
sum(judgments$female_justice)/nrow(judgments)
# > 0.3461538
One might furthermore ask whether the distribution of female justices is random. I complement our "human rights" variable from earlier with a "finance" variable, searching for the keyword bank|financ|econom
. This way we can test whether more female justices might be called to the bench to rule on human rights cases, while male dominance is even more prominent in the financial domain.
Let us first take a look at the presence of women in chambers related to questions of human rights:
ggplot(data = judgments,
aes(
x = gender_balance,
fill = human_rights)) +
geom_bar() +
labs(x = "Gender balance",
y = "Number of judgments",
fill = "Human rights") +
theme(legend.position = "top")
It appears that, relative to the standard of the UK Supreme Court, women are unusually well-represented in judgments related to human rights: In the nine human rights related judgments published in 2023, women were present in the chamber of five of them. That's more than half!
Keep in mind that this is still terrible numbers in terms of gender equality. Only one judgment published in 2023 ([2023] UKSC 36) saw two women serving together on the bench, making two the maximum number of women observed together in a chamber. Incidentally, this was a chamber of seven justices, meaning that the lowest number of male justices observed in a single chamber is four. It's hardly worth mentioning that we didn't see any chambers without any men in them. Oh well, moving on: What about the appointment of female justices to cases related to finance?
Here I depend on the finance
variable I created outside the examples of this guide. Readers who want to reproduce the results will have to get their hands dirty creating this variable themselves - maybe even expanding on it by adding additional keywords of relevance.
ggplot(data = judgments,
aes(
x = gender_balance,
fill = finance)) +
geom_bar() +
labs(x = "Gender balance",
y = "Number of judgments",
fill = "Finance") +
theme(legend.position = "top")
Out of ten judgments related to finance in 2023, women were only represented in the chamber of three of them, making female representation slightly worse than average for the Court for that year. It might seem the UK Supreme Court suffers not only from a dire shortage of female justices, but also a set of embedded beliefs that women are better suited working with human rights than with financial decision-making.
Such accusations would, of course, be unjustified without taking a more thorough look at the data: We would ideally want more than nine or ten observations to draw our conclusions from. Gathering data for more years than only 2023 is therefore necessary. To follow along in the code below the reader might attempt to expand on this guide to scrape data for a longer time period, or simply download the data as I provide it from my website. On the linked site you will also find a complete script to reproduce the full data set on the UK Supreme Court.
As the Court has only provided summaries of the facts and issues since 2020, this analysis remains limited in scope. We can, however, observe 32 judgments related to finance and 28 judgments related to human rights.
Looking at the presence of female justices in human rights judgments over time, we see no clear pattern over time. The same holds true for judgments relating to finance, with 2023 and 2021 being particularly bad years in terms of female representation in this area, but other years showing different patterns. In conclusion, it seems we would need more data in order to uncover a potential gender bias in the assignment of these types of cases.
What remains clear, however, is the persisting under-representation of women in the UK Supreme Court. As the panel of judges signing judgments is available all the way back to 2009, we can take a more thorough look at these developments over time.
After a brief positive trend in the late 2010s — in 2019 and 2020 the Court even published a total of four judgments with a majority female panel — the Court has since reverted back to earlier levels of male domination, with 48% of judgments published since 2000 being made by male-only panels.
A takeaway — beyond the lack of improvement in gender balance at the UKSC the last 15 years — is that empirical researchers need to execute caution when conducting empirical research, and especially so on limited samples. Descriptive statistics can prove extremely insightful, but it is also clear that 2023 alone is not a representative year to study the presence of women at the UKSC, and when studying a greater time period there is no sign that women are under-represented in finance cases vis-a-vis human rights ones.
If you are interested in further information about the graphical analysis of data, I cover this in some further detail in my guide to studying the Court of Justice of the European Union in R.
5. Conclusion
While limited in scope, the techniques presented in this guide provide a starting point for the automatised collection of data from a wide variety of online sources. If you are seeking to conduct your own data gathering on judicial institutions and face challenges along the way feel free to contact me, and I would be happy to try to help and potentially improve the guide for the benefit of future readers.
Data for the entire period since the Court started publishing its cases online in 2009 until these data was last updated in June 2024 is available here. I have also made available the script used to compile these data, based on the code presented on this page.
I conclude by repeating my encouragement of anyone who collects data on court to strive to make it publicly available, for the enjoyment of anyone interested in conducting empirical research into the work of courts. National legal systems don't exist in a vacuum, and collecting data on national courts throughout the world will allow for insights in empirical legal studies that no single researcher or even research project could achieve alone.