Studying the Court of Justice of the European Union in R

This is an introductory guide to studying the Court of Justice of the European Union (CJEU) in R, using data from the IUROPA CJEU Database.

It kicks off with a step by step guide describing how to load data from the database into R, before providing some advice on how to manage the data and concluding with some examples of how it can be visualised.

This is not intended to be a comprehensive guide for learning R, but rather a beginner's guide to studying the CJEU using the software. One resource that cannot be recommended enough for those wanting to learn R more properly is the brilliant and free book R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.

Table of contents:

  1. Accessing data
    1.1. Installing the data base
    1.2. Data tables
    1.3. Download data
  2. Data management
    2.1. Combining data across data sets
    2.2. Dealing with lists
  3. Analysing data
    3.1. Length of procedures over time
    3.2. Number of judgments in policy area
    3.3. Age of appointed judges
  4. Conclusion

1. Accessing the data

In this first part I describe how to access data from the IUROPA project in R, a general overview over what the data contains, and how to download it.

1.1. Installing the IUROPA package

The easiest way of accessing data from the IUROPA project in R is by installing the dedicated R package developed by Joshua Fjelstul.

As the package is hosted on Github, a neat way to install it is through the devtools package. If you do not have this package installed already, you can install it by running the following line:

install.packages("devtools")

As soon as the devtools package is installed you can install the IUROPA package directly from Github:

devtools::install_github("jfjelstul/iuropa")

Once installed, the package can be loaded as follows:

library(iuropa)

1.2. Data tables

You are now ready to use the IUROPA package to load data on the CJEU directly into R. There are a total of 17 different data tables available for download, all of which can be described in further detail by running the following:

description <- describe_tables(component="cjeu_database_platform")
View(description)

You can find a comprehensive overview of the tables included in the data base and the variables within them in the code book. The most important tables are briefly described below, along with a selection of variables found within them. Click on an entry to reveal more information about it.

The cases table contains all cases lodged before the CJEU.

cjeu_case_id
The ID code of the case before the Court, assigned to a case at the moment it enters the docket. This is the ID code used by the Court. In older cases official sources will often list case numbers with digits only (for example 6/64); in our data base, we always include the prefix indicating the tribunal (C-6/64).
is_joined
Binary variable: 1 if the case has been joined into another case, 0 if it has not. If is_joined is 1, decisions published in the case will be published under the case number listed under the variable joined_to.
joined_to
If the case has been joined, this lists the first case of the cases it was joined with.
case_name
The name of the case, for example Costa v E.N.E.L..
court
The court in which the procedure took place: The Court of Justice, General Court, or Civil Service Tribunal.
case_year
The year in which the case was initiated
is_pending
Binary variable observing whether a case is completed or if it is still pending before the CJEU (as of the last update of the database).
is_removed
Binary variable observing whether a case is removed from the register.

The judgments table includes data on the judgments of the Court, excluding other types of decisions such as AG opinions and orders of the Court.

cjeu_case_id
The ID code of the case before the Court. If multiple cases are joined, this variable lists the first of the joined cases.
ecli
ECLI numbers are the official ID codes given by the CJEU to its decisions.
celex
CELEX numbers are the official ID codes given to European legislation by EUR-lex. Most, but not all, judgments are given unique CELEX numbers, and they they are sometimes used as ID codes in data bases on the work of the CJEU.
decision_date
The date of the judgment. The date will be returned in text format; use the as.Date function to convert it into a date.
proceeding_date
The date of the lodging of the first case of the procedures leading up to the judgment. The date will be returned in text format; use the as.Date function to convert it into a date.
duration_days
The duration of the procedure leading up to the decision, in days. Corresponds to the subtraction of proceeding_date form decision_date.
court
The court in which the procedure took place: The Court of Justice, General Court, or Civil Service Tribunal.
list_authentic_languages
The authentic languages of the judgments, which generally correspond to the languages of the parties before the CJEU.
list_procedures
The type of procedure before the court, along with a list of outcomes when applicable. References for preliminary rulings have no listed outcomes, as the Court does not decide the case as such but merely provides an answer to the question referred by the national court.
is_preliminary_ruling
Binary variable observing whether the judgment is made in a preliminary ruling procedure.
is_direct_action
Binary variable observing whether the judgment is made in a direct action.
is_appeal
Binary variable observing whether the judgment is made in an appeal procedure.
list_referring_member_states
A list of the member states of the referring court(s). There is usually only one referring member state, but there could be several in joined procedures. Separated by comma.
count_observers
The number of observations submitted to the CJEU as part of the procedure.
list_observers
The names of the parties making submissions to the CJEU in the procedure.
iuropa_judge_rapporteur_id
The ID code of the judge serving as judge rapporteur in the judgment, corresponding to iuropa_judge_id in the judges data set.
judge_rapporteur
The surname of the judge serving as judge rapporteur in the judgment.

The decisions of the CJEU includes not only judgments, but also other documents published such as orders of the Court, orders of the President of the Court, and opinions of the advocates general (AG opinions). Each decision is identified by an unique "European Case Law Identifier", or ecli for short. For example Costa v ENEL saw the publication of three decisions: An order of the Court (ECLI:EU:C:1964:51), an AG opinion (ECLI:EU:C:1964:51), and a judgment (ECLI:EU:C:1964:66).

cjeu_case_id
The ID code of the case before the Court. If multiple cases are joined, this variable lists the first of the joined cases.
ecli
ECLI numbers are the official ID codes given by the CJEU to its decisions.
celex
CELEX numbers are the official ID codes given to European legislation by EUR-lex. The most important decisions of the CJEU are given unique CELEX numbers, and they they are sometimes used as ID codes in data bases on the work of the CJEU.
decision_type
The type of decision, such as AG opinion, order, or judgment.
decision_date
The date of the decision. The date will be returned in text format; use the as.Date function to convert it into a date.
court
The court in which the procedure took place: The Court of Justice, General Court, or Civil Service Tribunal.

This data table includes information about the judges, advocates generals, and registrars of the CJEU.

iuropa_judge_id
An unique ID code for each judge of the CJEU, created for the purpose of the IUROPA database. Historically some judges have the same name and the names of the judges might be spelled in a variety of ways, making the judge ID a useful variable for working with the judges, registrars, and advocates general of the CJEU.
full_name
The full name of the individual.
last_name
The surname of the individual.
last_name_latin
The surname of the individual, with accents and special characters removed.
start_date
The date the person first started working for the CJEU.
end_date
The date the person last worked for the CJEU.
member_state
The appointing member state.
birth_year
The year of birth of the individual.
is_female
Binary variable: 1 if the judge is female, 0 if not.
was_judge
Binary variable observing whether the individual worked as a judge prior to beginning at the CJEU.
was_academic
Binary variable observing whether the individual worked in an academic position prior to beginning at the CJEU.
was_civil_servant
Binary variable observing whether the individual worked as a civil servant prior to beginning at the CJEU.
was_lawyer
Binary variable observing whether the individual worked as a lawyer prior to beginning at the CJEU.
was_politician
Binary variable observing whether the individual worked as a politician prior to beginning at the CJEU.

As judges might have been working in different positions and for more than one of the tribunals making up the CJEU, more than one start and end date, as well as positions, might apply for a single individual. For more detailed information about the appointments to positions in the different tribunals of the CJEU, see the appointments data set.

The appointments data set observes the appointment of judges to various positions before the CJEU. Multiple observations can therefore be assigned to a single judge. The data set does not currently observe the reappointment of sitting judges.

iuropa_judge_id
The ID code of the individual judge.
start_date
The date the person first started working for the CJEU in the recorded position.
end_date
The date the person last worked for the CJEU in the recorded position.
position
The position of the judge at the CJEU: judge, registrar, Advocate General, President, or Vice-President.
duration_days
The duration of the term served by the individual, in days.

The assignments data set observes which judges sign the decisions of the CJEU. Each row represents one judge assigned to a decision. Generally, the Court is composed of one of three chamber compositions: A small chamber of three judges, a medium chamber of five or seven judges, or a grand chamber of 15 judges or more.

cjeu_case_id
The ID code of the case before the Court. If multiple cases are joined, this variable lists the first of the joined cases.
ecli
ECLI numbers are the official ID codes given by the CJEU to its decisions.
iuropa_judge_id
The ID code of the given judge.
judge
The name of the judge.
is_judge_rapporteur
Binary variable observing whether the judge served as judge rapporteur.

The procedures data set observes the procedures before the Court. The most common procedure types are references for preliminary rulings, actions for failure to fulfil obligations, actions for annulment, actions for failure to act, damages for non-contractual liability, appeals, and staff cases.

cjeu_case_id
The ID code of the case before the Court. If multiple cases are joined, this variable lists the first of the joined cases.
ecli
ECLI numbers are the official ID codes given by the CJEU to its decisions.
procedure
The type of procedure. While there are a number of different procedures, the most common ones are, in decreasing order: reference for a preliminary ruling, action for annulment, staff case, action for failure to fulfill obligations (aka. infringement procedures), appeal, action for damages, and appeal against a penalty.
is_successful
Binary variable observing whether the procedure is successful. Always 0 for preliminary references, as this is not a relevant question for these procedures.
is_unfounded
Binary variable observing whether the Court declares a case to be unfounded.
is_inadmissible
Binary variable observing whether the Court declares a case to be inadmissible.

This data set observes the parties before the cases of the CJEU. Each row represents a party to a case before the CJEU, along with the standardised type of actor such as member states or EU institutions.

cjeu_case_id
The ID code of the case before the Court.
party_role
The role of the party before the CJEU: applicant, defendant, or litigant.
party
The name of the party.
party_type
The type of party, such as legal person, EU institution, and EU member state.

The citations table includes the references included in a decision of the Court.

cjeu_case_id
The ID code of the case before the Court.
ecli
The ECLI number of the decision in which the citations are found.
cited_celex
The CELEX number of the cited legislation.
cited_type
The type of decision cited, such as treaty, legislation, or case law
cited_subtype
Offers more fine-grained type of citation types, such as regulation or judgment (Court of Justice).
cited_detail
Observes information such as the cited article and paragraph.

The observers table lists observations made by EU member states, EU institutions, EU agencies, EFTA member states, and EFTA institutions to preliminary reference procedures before the CJEU.

cjeu_case_id
The ID code of the case before the Court.
ecli
The ECLI number of the decision in which the citations are found.
observer
The name of the party submitting the observation, for example European Commission or Italy.
observer_type
The type of the party submitting the observation, for example EU institution or EU member state.

1.3. Downloading data

Data can be downloaded using the download_data function of the iuropa package. Downloading the entire judges data set can be done as follows:

judges <- download_data(
  component="cjeu_database_platform", 
  table = "judges"
)

While feasible for a smaller data set such as the one listing the judges of the CJEU, downloading the entire data set all at once is not particularly efficient for larger data sets. A good idea can therefore be to specify some filters for which data to download.

In order to specify such filters, you must first decide which data you are interested in. One option is to consult the code book; another is to use the describe_variables function to get familiar with the contents of a given table. For example, the following code downloads and views a description of all the variables in the decisions data set:

description <- describe_variables(
  component = "cjeu_database_platform", 
  table = "decisions"
)

View(description)

From this description we find that the decisions data set contains 19 different variables, as well as what kind of values these variables contain. The variable court may for example contain one of three different values: "Court of Justice", "General Court", or "Civil Service Tribunal". In contrast, other variables may contain a wide variety of values, such as cjeu_case_id which lists the official ID code of the procedure a case is lodged in.

We can use this information to download a specific subset of the data. For example, we might be interested in downloading all decisions published in Costa v ENEL, which can be done by specifying the case ID of Costa: C-6/64. The following code loads the three decisions of the Court in Costa v ENEL into an object called costa.

costa <- download_data(
  component = "cjeu_database_platform", 
  table = "decisions",
  filters = list(
    cjeu_case_id = "C-6/64"
  )
)

Instead of specifying a single case, a vector can be used to define several potential values. For example, substituting "C-6/64" for c("C-26/62", "C-6/64") will download a set of all decisions in both Costa v ENEL (C-6/64) and Van Gend en Loos (C-26/62).

As additional constraints, we may also define a list of variables to download, along with a filter containing a number of specifications. Let's say we are interested in downloading all judgments of the Court of Justice in cases lodged between 2010 and 2020. We furthermore are only interested in the case ID, the ECLI number, whether the procedure is a preliminary ruling, the judge rapporteur, subject matter, and the date of the judgments. We can specify this data selection as follows, downloading the data into an element named judgments:

judgments <- download_data(
  component = "cjeu_database_platform", 
  table = "judgments",
  filters = list(
    case_year = 2010:2019, # A vector: c(2010, 2011, ... 2019)
    court = "Court of Justice"
  ),
  variables = c(
    "cjeu_case_id", 
    "ecli", 
    "decision_date",
    "list_procedures", # List of procedure types
    "is_preliminary_ruling", # If preliminary ruling
    "list_subject_keywords", # Subject matters
    "judge_rapporteur", # Name of the judge rapporteur
    "iuropa_judge_rapporteur_id" # ID code of rapporteur
  )
)

An advice is to always include ID codes observed at the unit of analysis you are downloading the data for. As this data is observed for the level of decisions, ECLI numbers provide an unique ID code for all observations. This will be useful if you need to include more variables later on, or observe which decision an observation represents.


2. Data management

Downloading the data is, of course, only the beginning. Organizing and making sense of the data is a major task, and one that will require a decent amount of work. Here I will present two things to get you started: The moving of variables from one data set to another through matching ID codes, and the working with list variables using the apply function in R.

2.1. Combining data across data sets

Let's say we are interested in the gender of the judge rapporteur drafting the judgments of the Court. This poses a challenge, as the gender of the judges of the CJEU is known from the judges data set, but this information is not included directly in the judgments set. In order to get the gender of the judge rapporteur, we therefore have to download this from the judges data set.

We start out with the judgments data set from section 1.3. First, we observe all unique ID codes of the judge rapporteurs (iuropa_judge_rapporteur_id), which correspond to the iuropa_judge_id variable in the judges data set. We store all the observed ID codes in an element called rapporteurs.

We then download the data set of ID codes and gender (is_female) from the judges data set, using the rapporteurs object to filter out only relevant observations. Last, we use the match function to add the is_female variable (renamed as rapporteur_is_female) into judgments.

The match function creates a string matching corresponding values in two different vectors; in this case, ID codes observed in two different data tables. If you for example want to insert variable var from a table called data2 into another table called data1, with both tables containing the ID code id, it could be done using the following: data1$var <- data2$var[match(data1$id, data2$id)]. For merging to data tables you might also be interested in checking out the merge function.

# Gather ID codes of all judge rapporteurs observed in the data set
rapporteurs <- 
  unique(judgments$iuropa_judge_rapporteur_id)

# Download data from judges data set
judges <- download_data(
  component="cjeu_database_platform", 
  table="judges",
  filters = list(
    iuropa_judge_id = rapporteurs
  ),
  variables = c("iuropa_judge_id", "is_female")
)

# Insert variable rapporteur_is_female into judgments data frame  
judgments$rapporteur_is_female <-
  judges$is_female[match(
    judgments$iuropa_judge_rapporteur_id,
    judges$iuropa_judge_id)]

Now that we have the gender of the judge rapporteurs in the judgments of cases lodged in the 2010s, we can observe the percentage of women rapporteurs in the judgments of the Court:

# Get percentage distribution of male and female judge rapporteurs
table(judgments$rapporteur_is_female)/
  nrow(judgments)*100

# Get percentage of male and female judge  rapporteurs observed in the period
table(judges$is_female)/
  nrow(judges)*100

We find that 19.4% of the judgments published in the 2010s had female judge rapporteurs. Furthermore, we observe that 18.6% of the judge rapporteurs active in the period were female, meaning that the under-representation of women in the rapporteur role is caused by a lack of women on the bench, but that the women who do serve as judge rapporteurs are no less active than their male counterparts.

2.2. Dealing with lists

In some variables, more than one value may be observed for the same observation. When this is the case they are presented as lists in text format, separated by comma (for most variables) or semicolon (for subject matter, which might themselves contain commas). The easiest way to filter these variables is to conduct a simple text search. This is done below using the grep function, listing all ECLI numbers where at least one subject matter contains the search term "tax".

# Judgments with subject matter mentioning "tax"
tax_judgments <- judgments$ecli[
  grep("tax", 
       judgments$list_subject_keywords, 
       ignore.case = TRUE)
]

The grep function allows for the use of regular expressions, providing powerful search functionality. For example, a pipe symbol (|) means "OR": a search for "tax|trade" will return all observations where either tax or trade is mentioned.

Searching for procedure types in this way can be useful for identifying specific types of procedure using the list_procedures variable in judgments:

# Create binary variable indicating if a judgment is in an action for failure to fulfill obligations:
judgments$infringement <- 
  grepl("action for failure to fulfill obligations",
        judgments$list_procedures)

Though useful in many cases, doing such a broad search in list variables does not always provide the best outcome. For example, you might want to find decisions where the subject matter is "Taxation" specifically, rather than merely mentioning taxation. The stringr package provides useful tools for handling lists. Below, str_split is used to create a list variable called subject_matter.

# load package stringr. 
# Install with install.packages("stringr") if missing.
library(stringr)

# Turn list of subject matters into real list
judgments$subject_matter <- 
  str_split(judgments$list_subject_keywords, "; ") # Replace with comma if comma separated list

Having the variable as a proper list enables a number of possibilities. Let's first create an object subject_matters, listing all the unique subject matters observed in the data, and then use grep to identify subject matters related to taxation.

The unique function returns a vector of all unique values in an object; for example, unique(c("cat", "dog", "dog", "rabbit", "dog")) returns c("cat", "dog", "rabbit").

# Observe unique subject matters:
subject_matters <- unique(unlist(judgments$subject_matter))

# Return subject matters related to tax
subject_matters[grep("tax", 
                     subject_matters, 
                     ignore.case = TRUE)]

We observe four different tax related policy areas: "Taxation", "Value added tax", "Indirect taxation", and "Internal taxation". Some policy areas are, however, more common than others; by using the table function we can observe the number of observations in each one, and by filtering by the number of observations we can identify the most central subject matters observed in our judgments data set.

# Create table of subject matters 
subject_matter_table <- table(unlist(judgments$subject_matter))

# Identify the most common subject matters
# Include observations with more than 300 observations and sort by number of observations in decreasiong order.
sort(subject_matter_table[which(subject_matter_table > 300)], 
     decreasing = TRUE)

This shows that "Taxation" is the second most common subject matter, with 560 judgments in the observed period.

As "Taxation" is the only subject matter spelling taxation with an upper case T, a case-sensitive text search using grepl would work just fine for this data set to identify all observations belonging to this subject matter without false positives. This might, however, not always be the case: For example, searching for "Euro" would return every subject matter mentioning the word "European" or "Europe", not just those relating to the currency.

One solution to this problem is found in the apply family of functions. The following code adds a binary variable taxation to the judgments data set, observing whether Taxation is one of the policy areas of the judgment.

# Binary variable observing whether "Taxation" is a subject matter 
judgments$taxation <- sapply(
  judgments$subject_matter, function(y) 
    "Taxation" %in% y
)

This is a powerful starting point for working with lists. In this example, "Taxation" %in% y can be replaced with any command producing a TRUE or FALSE outcome, where y represents the contents of each individual row of subject_matter. If you want to test a command before including it in the apply function you can assign individual observations of judgments$subject_matter to an object y, and run it individually:

y <- judgments$subject_matter[1]
"Taxation" %in% y

This returns FALSE, which is correct as "Taxation" is not one of the subject matters of the first row of the judgments data set.

As within regular expressions, a pipe symbol can be used to define multiple alternative criteria, reading as "OR": TRUE | FALSE returns TRUE, as one of the two arguments are positive. In contrast, & can be used to mean "AND": TRUE & FALSE will return FALSE, whereas TRUE & (TRUE | FALSE) returns TRUE. Finally, ! can be used to mean "NOT": !TRUE equals FALSE and vice versa.

The which function is used to observe which elements of a string are TRUE: for example, which(c(TRUE, TRUE, FALSE, FALSE, TRUE)) will return c(1, 2, 5). It can be used to identify the rows of a data set in which a given criterion is met.

Below, this logic is illustrated with a series of examples:

# Judgments related to indirect taxation OR internal taxation:
indirect_or_internal <- which(
  sapply(
    judgments$subject_matter, function(y) 
      "Indirect taxation" %in% y | "Internal taxation" %in% y
  )
)

# Both Taxation AND Environment:
environment_tax <- which(
  sapply(
    judgments$subject_matter, function(y) 
      "Environment" %in% y & "Taxation" %in% y
  )
)

# Taxation AND NOT Value added tax:
tax_not_vat <- which(
  sapply(
    judgments$subject_matter, function(y) 
      "Taxation" %in% y & !"Value added tax" %in% y
  )
)

# Return number of judgments related to both taxation and environment:
length(environment_tax)

# Return the ECLI numbers of these cases:
judgments$ecli[environment_tax]

# Create subset of data containing only judgments related to taxation and environment
judgments_env_tax <- judgments[environment_tax,]

If you are seeking to create a list variable where, like in subject_matter, there can be multiple values for each observation, this can be achieved by using the lapply function instead of sapply. We might, for example, be interested in creating a list variable observing the outcome of procedures in the list_procedures variable. There might be several of these outcomes listed in the same judgment, and they are listed within a parenthesis after the procedure type, separated by comma.

A possible value is for example:

action for failure to fulfil obligations (successful, unfounded, inadmissible)

In order to create a list of outcomes we need to extract the text within the parenthesis, and split the list by comma. One way of doing so is as follows, taking advantage of pipes to perform actions in multiple steps within the lapply function.

judgments$outcome <- lapply(
  judgments$list_procedures, function(y)
    y |> 
      str_extract("\\(.*\\)") |> # Extracts text within parantheses
      str_remove_all("\\(|\\)") |> # Removes ( and )
      str_split(", ") |> # Splits the list at commas
      unlist() # Flattens said list to remove complexity
)

In the above example, the use of lapply and pipes is combined with three functions from the stringr package. First, str_extract is used to extract text from a string. In this case, regular expressions is used to extract everything between the first and the last parentheses observed: In \\(.*\\), . is used to indicate any character, whereas * marks any number of repetitions of it. As parentheses carry a specific meaning in regular expressions, they are escaped with a double backslash (\\) in order to be interpreted as characters rather than within the logic of regular expressions.

If faced with several occurrences of parenthesised text within the same text string, one could consider instead using str_extract_all in combination with .*?, which would extract all parentheses separately and present them as a list. The question mark indicates that rather than observing as many occurrences of any character as possible, the expression is to observe as few as possible, in this case ending as soon as the first end of a parenthesis is observed.

After extracting the parentheses, str_remove_all is used to remove them, leaving only the content within them. Again, the parentheses are escaped with a double backslash, and | is used within the regular expression to mean "or". Finally, str_split is used to turn the comma-separated list into a proper list of outcomes, before unlist turns the list into a vector in order to remove unnecessary complexity.


3. Analysing data

In this section I present some examples of how the data can be analysed, focusing on graphical representations of the data. I use the package ggplot2, which must be installed if it has not been used before (install.packages("ggplot2")).

3.1. Length of procedures over time

Let's say we are interested in the average length of preliminary reference procedures over time. In order to get to this, we would first need to download a data set of the judgments of the Court in preliminary reference procedures containing the length of its procedures:

# Download data set of preliminary rulings
judgments <- download_data(
  component = "cjeu_database_platform", 
  table = "judgments",
  filters = list(
    is_preliminary_ruling = 1
  ),
  variables = c(
    "cjeu_case_id", 
    "ecli", 
    "decision_date",
    "duration_days",
    "list_subject_keywords",
    "is_urgent_procedure"
  )
)

In ggplot, the variables defining the aesthetics of the graph are defined within the aes function: In this first example, we use it to define what we want to be plotted on the X and Y axes. With these aesthetic functions defined, geom_smooth is called upon to draw a line through the average values over time.

In order to ensure that decision_date is interpreted as a date rather than as a set of characters, it is wrapped within the as.Date function.

library(ggplot2)

ggplot(judgments,
    aes(
        x = as.Date(decision_date),
        y = duration_days
    )) +
  geom_smooth()

This provides a very simple plot showing the average length of preliminary reference procedures before the Court of Justice over time:

Having produced a minimal working example, we might go on to make some improvements to the plot, such as defining the minimum and maximum values to start the Y axis at zero and labelling the axes in a more appealing way. Furthermore, we might add a third dimension to the plot, drawing several lines: Below, the figure is adapted to draw urgent procedures in a separate line with a different colour.

# Create character variable for urgent procedure. 
# Ifelse works as follows: 
# if "arguemnt 1 is TRUE" then "argument 2", else "argument 3".
judgments$urgent <- 
  ifelse(judgments$is_urgent_procedure == 1,
         "Urgent procedure",
         "Not urgent procedure")

ggplot(judgments, 
       aes(
         x = as.Date(decision_date),
         y = duration_days,
         col = urgent
       )) +
  geom_smooth() +
  coord_cartesian(ylim = c(0, 800), # define limits on y axis
                  expand = FALSE) + # start plot in 0
  labs(x = "Date of decision",
       y = "Procedure time (days)",
       col = "") + # the lines need no further description
  theme(legend.position = "top") 

This produces the following graph, plotting urgent and non-urgent procedures separately and illustrating that urgent procedures successfully shorten procedure times in preliminary rulings with more than 400 days in most of the observed period.

3.2. Number of judgments in a policy area over time

Next, let's say we want to create a simple bar plot showing the number of environmental judgments over time. First, we download a data set of judgments by the Court of Justice and the General Court:

# Download data set of preliminary rulings
judgments <- download_data(
  component = "cjeu_database_platform", 
  table = "judgments",
  filters = list(
    court = c("Court of Justice",
              "General Court")
  ),
  variables = c(
    "cjeu_case_id", 
    "ecli", 
    "decision_date",
    "court",
    "list_subject_keywords"
  )
)

Having downloaded the data, we can filter out environment cases and add a variable observing the year of the decision. We also make an additional variable observing whether the judgment relates to pollution, agriculture and fisheries, or public health. A shortcut is made by treating these categories as mutually exclusive, and five observations are found to be affected by this. This is not enough to influence the interpretation of the graph in a significant way, but it is nevertheless worth taking notice of.

# Year of judgment:
judgments$decision_date <- 
  as.Date(judgments$decision_date)

judgments$decision_year <- 
  as.numeric(format(judgments$decision_date, "%Y"))

# Subject matter as list:
judgments$subject_matter <- 
  str_split(judgments$list_subject_keywords, "; ")

# Data set of environment judgments:
x <- which(sapply(judgments$subject_matter, function(y) 
  "Environment" %in% y
))

environment <- judgments[x,]

# Type of environmental case:
environment$type_environment <- "Other"
environment$type_environment[
  grep("Agriculture and Fisheries", environment$subject_matter)] <-
  "Agriculture and Fisheries"
environment$type_environment[
  grep("Public health", environment$subject_matter)] <-
  "Public health"
environment$type_environment[
  grep("Pollution", environment$subject_matter)] <-
  "Pollution"

# Observe how many judgments fit in multiple categories:
sum(grepl("Agriculture and Fisheries", environment$subject_matter) & 
      grepl("Public health", environment$subject_matter) | 
      grepl("Agriculture and Fisheries", environment$subject_matter) & 
      grepl("Pollution", environment$subject_matter) | 
      grepl("Public health", environment$subject_matter) & 
      grepl("Pollution", environment$subject_matter))

# It is only five.

Having prepared the data, it's time to draw the plot. The geom_bar function is used to draw a nice bar plot in ggplot, with the fill colour used to indicate the different subcategories of subject matters we defined in the type_environment variable. As we want the bar plot to simply count the number of observations in each year, no Y axis needs to be defined.

# Draw bar plot
ggplot(environment, 
       aes(
         x = decision_year,
         fill = type_environment
       )) +
  geom_bar() + 
  labs(x = "Year",
       y = "Number of CJEU judgments on environment",
       fill = "Type of issue:") +
  theme(legend.position = "top")

This produces the following graph:

The figure shows an increasing number of environmental cases until around 2010, after since the number stabilised (and dropped slightly during COVID). While the number of judgments relating to pollution dropped since the early 2010s, environmental cases that are also seen as relating to public health has started popping up since the late 2000s. Agriculture and Fisheries, which was responsible for the first judgments of the CJEU categorised as relating to the environment, has become relatively sidelined, though a small number of these judgments are still made on a near yearly basis.

When using the subject matter variable it is important to keep in mind that this is based on the official metadata of the CJEU and the EU: It is likely that some cases might relate to the environment yet go under the radar of the official classification.

3.3. Age of appointed judges

The appointments data set can be used to analyse the age of judges when they are appointed to the CJEU. Furthermore, we can compare the age of appointed judges according to a number of different criteria, such as the tribunal they are appointed to or the gender of the judge.

First, we need to download data on the appointments of judges. I chose to limit the selection to judges (excluding registrars and advocates general), and excluding the Civil Service Tribunal.

# Download judge appointments
appointments <- download_data(
  component="cjeu_database_platform", 
  table = "appointments",
  filter = list(
    position = "judge",
    court = c("Court of Justice", 
              "General Court")
  ),
  variables = c(
    "iuropa_judge_id", 
    "birth_year", 
    "is_female", 
    "member_state",
    "court",
    "start_date")
)

The next step is to calculate the age of the judges at the time of appointment. This will be approximate, as only the year and not the date of birth is included in the data set. We can calculate the age by first converting start_dateinto a year by using a combination of as.Date, format, and as.numeric. The age of the judge can then be calculated using simple subtraction.

# Convert start date to date format
appointments$start_date <- as.Date(appointments$start_date)

# Age at beginning of appointment
appointments$start_year <- 
  as.numeric(
    format(
      appointments$start_date,
      "%Y" # Changes date to year (in character format)
    )
  )

appointments$age <- 
  appointments$start_year - appointments$birth_year

Having identified the age of the judges when they were first appointed, we may take a moment to consider possible problems with our empirical approach.

First, some judges might have multiple recorded appointments to the same tribunal, if they left the tribunal for a period of time for later to return to it. We might do well to remove these observations of repeated appointments to the same tribunal.

Second, the age of appointed judges might be subject to change over time. As the General Court (established 1989) is a younger tribunal than the Court of Justice (established 1952). While one way of controlling for this would be to plot the average age of appointed judges over time similar to what is done above for the length of procedures, I instead choose here to exclude judges appointed before the first appointments were made to the General Court on September 25, 1989.

# Sort in order of appointment
appointments <- appointments[order(appointments$start_date),]

# Remove observations of judges who returned to the same tribunal after a break
# This line keeps observations that are not duplicates of both judge ID and court.
appointments <- 
  appointments[which(!duplicated(
    paste0(
      appointments$iuropa_judge_id, 
      appointments$court
    )
  )
  ),]

# Identify the date of the first appointment made to the General Court
GC_start <- min(
  appointments$start_date[
    which(appointments$court == "General Court")
  ])

# Exclude observations prior to the first General Court appointment
appointments <- appointments[which(
  appointments$start_date > GC_start
),]

# Add gender variable with text labels
appointments$gender <- ifelse(appointments$is_female == 1,
                              "Female",
                              "Male")

Having prepared the data, the only thing remaining is to draw the plots. A box plot is a useful way of representing the distribution of data, and will be used for this purpose here. I decide to plot the two courts on the X axis, the age of the judges on the Y axis, and to fill the boxes with colour according to gender.

ggplot(appointments, 
       aes(
         x = court,
         y = age,
         fill = gender
       )) +
  geom_boxplot() +
  labs(x = "Court",
       y = "Age of judge",
       fill = "Gender:") +
  theme(legend.position = "top") 

Box plots represent the distribution of observations. The thin vertical lines, known as whiskers, indicate the extreme observations: The minimum and maximum values of the distribution. The boxes themselves represent the span of observations between the 25th and 75th percentiles, meaning that half the observations will be within the boxes. The thick line within the boxes show the median values.

The figure shows clear differences both between the two courts and between female and male judges in the CJEU. Unsurprisingly, the General Court tends to receive younger judges than the Court of Justice. The extreme observations are, however, more widespread, which might in part be a consequence of the General Court having seen more appointments in total in the period due to the its expansion to two judges per member state.

Perhaps more interestingly, the women appointed to the CJEU tend to be younger than their male colleagues, and especially so in the Court of Justice: The median age of female appointments to the Court of Justice in the period is lower than the 25th percentile of male judges. That said, as there are only twelve women appointed to the Court of Justice in the period, one should be careful not to draw too bombastic conclusions based on this evidence.


4. Closing remarks

While I of course only scratch the surface here, I hope this helps to open the door to study the CJEU using empirical methods. My plan is to update this document as I hear from people who want help on any specific challenges, or based on any potential feedback. If you have any such feedback of anything you would like an introduction to when it comes to using the data on the CJEU, don't hesitate to send me an email! The same goes for if you need additional information that is not currently included in the IUROPA database, as I might be able to help you out.

For those interested in digging deeper into the world of R, I can only repeat my recommendation of R for Data Science.

If you do end up using the data from the IUROPA platform, please cite the CJEU Database Platform: Decisions and Decision-Makers. You can find the BibLaTex and RIS entries here.