Object Oriented Programming (OOP) with Datasets

Showing the idea of using OOP (i.e. S3) to define and manipulate datasets in a standard way

Published

October 29, 2023

Datasets can come in many different shapes and sizes, such as a number of rows and columns. But what if I need to interface with datasets in a specific but standard way? I can create a class for this.

Let’s say I have two datasets that I want to represent in a standard way.

d1 <- datasets::airquality
d2 <- datasets::anscombe
d1 |> head(5)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
d2 |> head(5)
  x1 x2 x3 x4   y1   y2    y3   y4
1 10 10 10  8 8.04 9.14  7.46 6.58
2  8  8  8  8 6.95 8.14  6.77 5.76
3 13 13 13  8 7.58 8.74 12.74 7.71
4  9  9  9  8 8.81 8.77  7.11 8.84
5 11 11 11  8 8.33 9.26  7.81 8.47

These two example datasets are actually classes already (i.e. S3 classes), and are data.frames (which are also classes).

sloop::otype(d1)
[1] "S3"
sloop::s3_class(d1)
[1] "data.frame"
sloop::otype(d2)
[1] "S3"
sloop::s3_class(d2)
[1] "data.frame"

So what’s the issue if they are already classes? I actually want to know the number of rows and columns including the name of the data. That’s easy to derive, but think ‘what if they weren’t?’ - having information available would be handy. The class would then be a simple interface to that information. To that end, I would make an object with class ‘my_data’ and the below attributes. This generates a fancy list using structure:

structure(
    list(
        data.frame = NULL,
        name = NULL,
        ncols = function(x)ncol(x),
        nrows = function(x)nrow(x)
    ),
    class = "my_data"
)
$data.frame
NULL

$name
NULL

$ncols
function(x)ncol(x)

$nrows
function(x)nrow(x)

attr(,"class")
[1] "my_data"

To create objects of this class, I would create a constructor. The constructor would just be a function taking in the input and wrapping it in my new class (a fancy list):

my_data <- function(dat,name = NULL){
    stopifnot(is.data.frame(dat))
    structure(
    list(
        data.frame = dat,
        name = name,
        ncols = ncol(dat),
        nrows = nrow(dat)
    ),
    class = "my_data"
    )
}

d1_data <- my_data(d1,"d1")
d2_data <- my_data(d2,"d2")

d1_data$name
[1] "d1"
d2_data$ncols
[1] 8

This seems sort of useful, but what would be more useful is showing me the information I want in the way I want. To do this, I make a method for this class:

show.my_data <- function(x){
    cat(x$name,"\n")
    cat("Rows: ",x$nrow,"\n")
    cat("Columns: ",x$ncol)
}

show <- function(x){
    UseMethod("show")
}
show(d1_data)
d1 
Rows:  153 
Columns:  6
show(d2_data)
d2 
Rows:  11 
Columns:  8

I defined a show method that is specific to objects of my class. Pretty nifty. This ability gets niftier as the complexity of the attributes and the operations on objects of this class increases.