Structuring code for ggplot

Published: April 28, 2020   |   Read time:

Tagged:

Image Attribution:

R is not a language that is known for its beauty. As far as programming languages go, it has some unconventional standards for code (<- is preferred over = for variable assignment, for example). It doesn’t have many linting standards and feels like a bit of a black sheep, compared to other data oriented languages like Python, C++, and others. It’s not surprising, given its history and focus on statistical computing, but I find that that history comes through in the code in strange ways.

Take this example from the ggplot2 documentation1.

ggplot(diamonds, aes(carat)) +
  geom_density()

That R code produces this figure.

-80%

Why write it like that? Why put the + at the end of the line and the geom_density() on a new line? I understand why it’s needed to have the code execute properly, but why use that as a style of writing?

This type of R code is common with ggplot. Take a look at this StackOverflow example (or almost any other ggplot-related question). The accepted answer is this:

ggplot(data = dta) +
    geom_col(aes(x = group, y = value, fill = sector)) +
    geom_text(aes(x = group, y = value, label = value, group = sector),
                  position = position_stack(vjust = .5))

It feels like R’s preferred way of writing this block of object-oriented code:

new_obj = obj.object_method1()
    .object_method2()    # each method applies to the object
    .object_method3()    # returned from the previous line

Would look something more like this:

new_obj = obj.          # this . is related to object_method1
    object_method1().   # this . is related to object_method2
    object_method2().
    object_method3()

Which just seems wild, to me. Putting the . at the end of the line says “there’s something coming up, be sure to read the next line”, but it doesn’t give any clue as to what that next line is about. It breaks the conceptual link between the method and the object it’s being applied to.

Similarly in the R code, putting the + at the end of the line says “there’s another object I want you to add”, but it breaks the concept of addition and the second object that is being added to the first. It makes R code look radically different from other programming languages, even if it’s very similar in concept.

More practically, it also means that you can’t easily copy and paste lines of code, because you have to manually remove the trailing + if you re-order lines or remove some. In the first object-oriented example, every set of sequential lines including the first is valid in and of itself.

# valid
new_obj = obj.object_method1()
    # .object_method2()
    # .object_method3()

# valid
new_obj = obj.object_method1()
    .object_method2()
    # .object_method3()

# valid
new_obj = obj.object_method1()
    .object_method2()
    .object_method3()

# may be valid if object_method3 accepts the output from object_method1
new_obj = obj.object_method1()
    # .object_method2()
    .object_method3()

Written in the R-inspired way, no single line is valid, only the entire block.

# invalid
new_obj = obj.

# invalid
new_obj = obj.
    object_method1().


# invalid
new_obj = obj.
    object_method1().
    object_method2().

# valid
new_obj = obj.
    object_method1().
    object_method2().
    object_method3()

It shouldn’t be difficult to see what trouble this can cause.

If an entire block of code should not be separated, one should use brackets of some kind. So, to combine the conceptual and stylistic notation for working with objects in R code, I propose this radical solution to the first example:

  1. Encompass the entire resulting object in ()
  2. Put + on the next line with the subsequent object

For example, here is the ggplot documentation code:

(ggplot(diamonds, aes(carat))
    + geom_density()
)

Not a big deal, and since it’s so simple, it would probably make sense to just put them on the same line. But let’s revisit the StackOverflow example code with this new code style:

(ggplot(data = dta)
    + geom_col(aes(x = group, y = value, fill = sector))
    + geom_text(aes(x = group, y = value, label = value, group = sector),
                    position = position_stack(vjust = .5))
)

Tada! The code itself is not vastly different, but look at how much easier that is to read. The + is in the same place on each line. Each new line at the first indent is a new aesthetic to be added. Subsets of lines are still valid2.

# valid
(ggplot(data = dta)
    # + geom_col(aes(x = group, y = value, fill = sector))
    # + geom_text(aes(x = group, y = value, label = value, group = sector),
    #                 position = position_stack(vjust = .5))
)

# valid
(ggplot(data = dta)
    + geom_col(aes(x = group, y = value, fill = sector))
    # + geom_text(aes(x = group, y = value, label = value, group = sector),
    #                 position = position_stack(vjust = .5))
)

# valid
(ggplot(data = dta)
    # + geom_col(aes(x = group, y = value, fill = sector))
    + geom_text(aes(x = group, y = value, label = value, group = sector),
                    position = position_stack(vjust = .5))
)

# valid
(ggplot(data = dta)
    + geom_col(aes(x = group, y = value, fill = sector))
    + geom_text(aes(x = group, y = value, label = value, group = sector),
                    position = position_stack(vjust = .5))
)

# assignment to a variable still fits the style and is easy to read
gg <- (
    ggplot(data = dta)
    + geom_col(aes(x = group, y = value, fill = sector))
    + geom_text(aes(x = group, y = value, label = value, group = sector),
                    position = position_stack(vjust = .5))
)

And for longer, more complicated plots, it’s almost too easy to see what’s happening. Here is some code related to a publication I’m a co-author on3:

gg = (
    ggplot(
        data = pca_exprs,
        mapping = aes(x = Name, y = log2TPM)
    )
    + geom_boxplot(fill = "#26A69A")
    + geom_point(
        data = pca_foxa1_exprs,
        mapping = aes(x = Name, y = log2TPM),
        fill = "#EF5350",
        pch = 21,
        size = 4
    )
    + labs(x = NULL, y = "Gene Expression log2(TPM + 1)")
    + guides(fill = FALSE)
    + theme_classic()
    + theme(
        # font sizes for axes and legend
        axis.text.x = element_text(size = 12, angle = 90, hjust = 1),
        axis.text.y = element_text(size = 12),
        axis.title = element_text(size = 16),
        legend.text = element_text(size = 12),
        legend.title = element_text(size = 16),
        # plot background colouring
        axis.ticks = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(colour = "#9e9e9e"),
        panel.background = element_rect(fill = "transparent")
    )
)

Resulting figure -80%

It’s visually appealing. It’s not more cluttered than it needs to be. It’s easy to see the separation of the layers this way. The design philosophy behind ggplot is reflected in the code. That is why I recommend structuring R code for ggplot this way.

Footnotes

  1. ggplot2 is probably the most popular plotting package in R 

  2. This is very important when designing figures, and simplifies the process drastically 

  3. That code is copied directly from here. It is from this repo, which is from this paper