Tutorial: Polymorphism in C

Christopher Bazley, October-December 2022

A tower of toy wooden bricks of different colours and shapes

Introduction

I cited some examples of subtyping polymorphism in my opinion piece, 'C versus C++: fight!' This tutorial develops those examples into a full program, showing why such techniques are useful and how to implement them.

The word polymorphism is from Ancient Greek, simply meaning "many shapes". Dictionary definitions of the word typically say something like "existing in several different forms".

Wikipedia provides a more practical definition for programmers:

polymorphism is the provision of a single interface to entities of different types

Hopefully this tutorial will be of use to people learning C programming, or those unsure of how such a theoretical concept applies to a low-level language such as C.

Problem statement

Let's imagine that we want to write a command line tool to add the values of every byte in a file of unknown length. The result could be used as a checksum to detect errors in the file, although the proposed algorithm doesn't detect reordering of bytes.

Here's a simple C program which implements this idea:

#include <stdio.h>

unsigned char sum_bytes(FILE *f)
{
    unsigned char sum = 0;
    for (int c = fgetc(f); c != EOF; c = fgetc(f)) {
        sum += c;
    }
    return sum;
}

int main(int argc, char const *argv[])
{
    return sum_bytes(stdin);
}

Now, we need to test our program to ensure it doesn't contain coding errors. To do so, we could create a file containing a known sequence of bytes, pipe that input into our program, then check that the return value of the program is what we expect.

For example, such a test could be implemented as the following Bash script:

echo -en '\x1\x2\x3\x4' | checksum
if [ $? -eq 10 ]; then
    echo "Test passed"
else
    echo "Test failed"
fi

This solution isn't ideal because the test script has limited functionality and isn't portable to other platforms. We might instead want to test the sum_bytes() function in isolation, to help locate the cause of any bugs.

We could write a test program that calls sum_bytes() on a file instead of the standard input stream, but it would need to create that file first:

#include <stdlib.h>
#include <stdio.h>

int main(int argc, char const *argv[])
{
    FILE *f = fopen("testdata", "w+b");
    if (!f) {
        fputs("Can't create file\n", stderr);
        return EXIT_FAILURE;
    }
    static unsigned char const testdata[] = {1, 2, 3, 4};
    int err = EXIT_SUCCESS;
    if (fwrite(testdata, sizeof(testdata), 1, f) != 1) {
        fputs("Can't write to file\n", stderr);
        err = EXIT_FAILURE;
    } else {
        rewind(f);
        if (sum_bytes(f) == 10) {
            puts("Test passed");
        } else {
            fputs("Test failed\n", stderr);
            err = EXIT_FAILURE;
        }
    }
    fclose(f);
    return err;
}

The above solution requires the test to have write access to a file system on which to store temporary files. We could solve that by linking the test with a mock definition of fgetc() that reads a test vector, or by using a macro to redefine fgetc() as an alias for a mock function of a different name.

Both are valid approaches to unit testing but fall into the categories of link-time polymorphism and generic programming. This tutorial is instead concerned with run-time polymorphism, which allows a single instance of sum_bytes() to read from different types of data source within a statically linked program.

The fundamental problem with the sum_bytes() function is that it only operates on the type FILE * from <stdio.h>. We might instead want to calculate a checksum for data stored in read-only memory, data read from a network, or for data output by a decompression library such as ZLib. The C standard library's streams are a useful abstraction, but too limited to support all use-cases.

Naive solution

A naive solution to allowing the sum_bytes() function to read from different data sources would be to use conditional logic. For example, we could pass both the address of a FILE and the address of an array, and use a Boolean argument to determine which data source to use.

Here's a suitably modified version of the test program:

#include <stdbool.h>
#include <stdlib.h>
#include <stdio.h>

unsigned char sum_bytes(bool read_file, FILE *f,
                        int const *buffer)
{
    unsigned char sum = 0;
    size_t i = 0;
    for (int c = read_file ? fgetc(f) : buffer[i++];
         c != EOF;
         c = read_file ? fgetc(f) : buffer[i++])
    {
        sum += c;
    }
    return sum;
}

int main(int argc, char const *argv[])
{
    static int const testdata[] = {1, 2, 3, 4, EOF};
    if (sum_bytes(false, NULL, testdata) == 10) {
      puts("Test passed");
      return EXIT_SUCCESS;
    } else {
      fputs("Test failed\n", stderr);
      return EXIT_FAILURE;
    }
}

The sum_bytes() function is becoming unwieldy and error-prone to use because the caller must pass three arguments instead of one. It's easy to wrongly pass true instead of false, or vice-versa. This solution also scales poorly because the complexity of sum_bytes() increases linearly with the number of input data types supported.

Polymorphism to the rescue

A more scalable solution is for the caller of sum_bytes() to pass the address of a function to be called in place of fgetc(), and a void * pointer to be passed when calling that function:

unsigned char sum_bytes(int (*getc_fn)(void *), void *getc_arg)
{
    unsigned char sum = 0;
    for (int c = getc_fn(getc_arg);
         c != EOF;
         c = getc_fn(getc_arg))
    {
        sum += c;
    }
    return sum;
}

I've just described a single interface to entities of different types, therefore this solution qualifies as polymorphism.

The interface is still somewhat unwieldy because the caller of sum_bytes() must pass two arguments instead of one. It is also less type-safe than the previous interface: the caller must take care to pass a value of getc_arg that is compatible with the specified getc_fn. The compiler cannot help enforce that because any (unqualified) pointer type implicitly converts to void *.

Lastly, the required type of getc_fn isn't compatible with the standard fgetc() function, which expects an argument of type FILE * not void *, therefore the caller of sum_bytes() must use an adaptor function:

static int stream_reader_getc(void *cb_arg)
{
    return fgetc(cb_arg); // implicit conversion to FILE *
}

int main(int argc, char const *argv[])
{
    return sum_bytes(stream_reader_getc, stdin);
}

What we really want is for sum_bytes() to present a simple and type-safe interface which eliminates the possibility of error.

Let's start by wrapping the callback function address and associated data pointer into a struct to avoid them getting separated or mixed up with other variables:

struct reader {
  int (*getc_fn)(void *);
  void *getc_arg;
};

The definition above creates a unique type, struct reader, which cannot be used interchangeably with any other type (not even another struct that happens to have the same members). In object-oriented programming, struct reader would be termed an 'abstract base class' and its getc_fn pointer a 'virtual method'.

Whereas the previous version of sum_bytes() would accept the address of any callback function with a compatible signature (argument and return types), the new version only accepts the address of a struct reader object:

unsigned char sum_bytes(struct reader *r)
{
    unsigned char sum = 0;
    for (int c = r->getc_fn(r->getc_arg);
         c != EOF;
         c = r->getc_fn(r->getc_arg))
    {
        sum += c;
    }
    return sum;
}

int main(int argc, char const *argv[])
{
    struct reader r = {stream_reader_getc, stdin};
    return sum_bytes(&r);
}

To avoid mistakes when instantiating objects of type struct reader, let's create functions to initialize an object for different types of data source. These would be termed 'subclass constructors' in object-oriented programming:

void stream_reader_init(struct reader *r, FILE *stream)
{
    *r = (struct reader){stream_reader_getc, stream};
}

void buf_reader_init(struct reader *r, int *buffer)
{
    *r = (struct reader){buf_reader_getc, buffer};
}

This is already looking much safer, but I haven't yet provided a definition of the buf_reader_getc() function to read a test vector. Writing that function turns out to be problematic:

static int buf_reader_getc(void *cb_arg)
{
    static size_t bytes_read;
    int const *const buffer = cb_arg; // implicit conv. to int *
    return buffer[bytes_read++];
}

Whilst this might be acceptable for a throwaway test, it is not acceptable for a real program! The static storage class of the bytes_read counter means that only one buffer can be read using this function, and that buffer can be read only once.

What we actually need would be termed an 'instance variable' in object-oriented programming: every instance of struct reader initialized by buf_reader_init() must have its own copy of bytes_read which starts counting from 0. This can only be implemented by allocating extra memory for instances of struct reader that read from a buffer instead of from a file.

Now might be a good time to consider subtyping.

Subtyping polymorphism

Subtyping is not supported by C in the sense that subtypes can be used interchangeably with their supertype, because the type system is not hierarchical.

However, something like subtyping can be implemented by nesting a struct representing a supertype within a struct representing its subtype. Here's that approach applied to struct reader, which has been extended with a new subtype named struct buf_reader:

struct buf_reader {
  struct reader super;
  int const *buffer;
  size_t bytes_read;
};

static int buf_reader_getc(void *cb_arg)
{
    struct buf_reader *br = cb_arg;
    return br->buffer[br->bytes_read++];
}

void buf_reader_init(struct buf_reader *br, int const *buffer)
{
    *br = (struct buf_reader){
      .super = {buf_reader_getc, br},
      .buffer = buffer,
      .bytes_read = 0
    };
}

The test program now declares an object of the new subtype which contains extra space for instance variables, buffer and bytes_read. As before, it calls buf_reader_init() to initialize the object:

int main(int argc, char const *argv[])
{
    struct buf_reader br;
    static int const testdata[] = {1, 2, 3, 4, EOF};
    buf_reader_init(&br, testdata);
    if (sum_bytes(&br.super) == 10) {
      puts("Test passed");
      return EXIT_SUCCESS;
    } else {
      fputs("Test failed\n", stderr);
      return EXIT_FAILURE;
    }
}

The test vector can now be qualified as const (and perhaps stored in read-only memory) because the address of this array is no longer converted to an unqualified void * pointer in the buf_reader_init() constructor.

Note that the test must explicitly pass the address of the embedded struct reader when calling sum_bytes(), because struct reader and struct buf_reader are distinct from the point of view of the compiler. This would be termed 'upcasting' in object-oriented programming, where it is usually implicit.

Improvements

There are still a few questionable things about this design:

Requiring the test to provide an array of type int terminated by the macro value EOF, instead of an array of unsigned bytes, is awkward.
Storing the address of struct buf_reader in struct reader seems redundant; both objects have the same address. In any case, the address of either object could be calculated from the other using offsetof().

The first point can be addressed by adding a new instance variable, buf_size, to store the maximum number of bytes to be returned by buf_reader_getc():

struct buf_reader {
  struct reader super;
  unsigned char const *buffer;
  size_t bytes_read, buf_size;
};

static int buf_reader_getc(void *cb_arg)
{
    struct buf_reader *br = cb_arg;
    if (br->bytes_read >= br->buf_size) {
      return EOF;
    }
    return br->buffer[br->bytes_read++];
}

void buf_reader_init(struct buf_reader *br,
                     unsigned char const *buffer,
                     size_t buf_size)
{
    *br = (struct buf_reader){
      .super = {buf_reader_getc, br},
      .buffer = buffer,
      .bytes_read = 0,
      .buf_size = buf_size,
    };
}

int main(int argc, char const *argv[])
{
    struct buf_reader br;
    static unsigned char const testdata[] = {1, 2, 3, 4};
    buf_reader_init(&br, testdata, sizeof(testdata));
    if (sum_bytes(&br.super) == 10) {
      puts("Test passed");
      return EXIT_SUCCESS;
    } else {
      fputs("Test failed\n", stderr);
      return EXIT_FAILURE;
    }
}

Addressing the second point requires an interface change: instead of passing a void * pointer when calling the virtual method to get a byte, sum_bytes() could pass the address of the struct reader instance. This pointer to the current object would be termed 'self' or 'this' in object-oriented programming.

Here are are the modified struct reader and sum_bytes() definitions:

struct reader {
  int (*getc_fn)(struct reader *);
};

unsigned char sum_bytes(struct reader *r)
{
    unsigned char sum = 0;
    for (int c = r->getc_fn(r);
         c != EOF;
         c = r->getc_fn(r))
    {
        sum += c;
    }
    return sum;
}

Deleting the cb_arg member of struct reader means that the FILE * from which fgetc() reads data must instead be stored in an instance variable. That entails defining a second subtype, which I have named stream_reader:

#include <stddef.h>
#define container_of(addr, type, member) \
  ((type *)(((char *)(addr)) - offsetof(type, member)))

struct stream_reader {
  struct reader super;
  FILE *stream;
};

static int stream_reader_getc(struct reader *r)
{
    struct stream_reader *sr =
      container_of(r, struct stream_reader, super);
    return fgetc(sr->stream);
}

void stream_reader_init(struct stream_reader *sr, FILE *stream)
{
    *sr = (struct stream_reader){
      .super = {stream_reader_getc},
      .stream = stream
    };
}

Updating the functions associated with the first subtype (buf_reader_init() and buf_reader_getc()) to conform to this new interface is left as an exercise for the reader.

Inheritance or composition?

I used a de facto standard macro, container_of(), to calculate the address of a struct stream_reader from the address of its struct reader member. This would be termed 'downcasting' in object-oriented programming. It is inherently unsafe because the compiler cannot be sure that the given object is actually an instance of the specified subtype.

I wrote earlier that "both objects have the same address", so you might wonder why I used container_of() instead of simply casting to the required subtype struct stream_reader *.

In part, this choice comes down to whether or not you subscribe to the notion of inheritance. If you believe that stream_reader is a reader (inheritance), then a simple cast makes perfect sense; if you instead believe that stream_reader has a reader (composition), then container_of() makes more sense.

Using container_of() does have some practical advantages:

The compiler checks that struct stream_reader has a member named super; cleverer definitions of container_of() also force the compiler to check that the type of super is struct reader.
If the super member is accidentally moved from its position at the start of the definition of struct stream_reader then the program is still likely to work.
You might later decide that you want objects of type struct stream_reader to implement more than one interface ('multiple inheritance').

However, container_of() also has a big disadvantage, if you care about writing strictly-conformant code:

When an expression that has integer type is added to or subtracted from a pointer... If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.

(6.5.6 'Additive operators', ISO/IEC 9899:1999 'Programming languages — C')

The char * cast in the macro definition allows the designated member to be reinterpreted as an array, but it is not an array object, and the result of subtracting offsetof(type, member) may point outside it. In practice, the given definition works for all known compilers and target architectures.

Most programmers simply ignore the fact that container_of() may have undefined behaviour. It's up to you whether you want to follow them. You can instead continue to store a void * pointer in each instance of a supertype, or (if you don‘t need multiple inheritance) ensure that super is always the first member of a subtype struct. In the latter case, container_of() has defined behaviour equivalent to a simple cast.

Whither encapsulation?

In most circumstances, I'd be happy with the solution that I have presented above. It's simple, efficient, and type-safe (except for my use of container_of() in the virtual method definitions).

When working on a larger project, or when designing interfaces for which it's necessary to maintain binary compatibility, you might want to hide some implementation details. In C programs, the way to do so is using incomplete struct types. These can be used in any situation where the size of a struct is not needed and its members are not accessed directly.

Here are a few encapsulation failures to consider remedying:

The sum_bytes() function requires struct reader to be a complete type in order to call the virtual method that it uses to get data. This exposes any instance variables belonging to the supertype.
The checksum program requires struct stream_reader to be a complete type in order to declare an object of that type, and to allow it to obtain the address of the embedded struct reader. This exposes any instance variables belonging to both supertype and subtype.
The test requires struct buf_reader to be a complete type in order to declare an object of that type, and to allow it to obtain the address of the embedded struct reader.

The first point can be addressed by creating a header file which declares an incomplete struct reader type and wrapper functions to call any virtual methods that it provides (in this case, only one).

The complete definition of struct reader needed to implement subtypes must be put in a separate private header file. Provided that the private header is only included in source files which need the complete type, the struct members will be hidden.

reader.h — public interface

struct reader;
int reader_getc(struct reader *);

reader_struct.h — private interface

struct reader {
  int (*getc_fn)(struct reader *);
};

reader.c — private implementation

#include "reader.h"
#include "reader_struct.h"

int reader_getc(struct reader *r)
{
  return r->getc_fn(r);
}

The second point can be addressed by writing an alternative constructor for struct stream_reader that dynamically allocates memory for new object instances, and a function to obtain the address of the struct reader embedded in an instance of struct stream_reader.

Once again, the complete definition of struct stream_reader needed to implement the subtype, or allocate instances of it, must be put in a private header file.

stream_reader.h — public interface

#include <stdio.h>
#include "reader.h"

struct stream_reader;
void stream_reader_init(struct stream_reader *sr, FILE *stream);
struct stream_reader *stream_reader_new(FILE *stream);
struct reader *stream_reader_get_reader(struct stream_reader *);

stream_reader_struct.h — private interface

#include <stdio.h>
#include "reader_struct.h"

struct stream_reader {
  struct reader super;
  FILE *stream;
};

stream_reader.c — private implementation

#include "stream_reader.h"
#include "stream_reader_struct.h"

static int stream_reader_getc(struct reader *r)
{
    struct stream_reader *sr = container_of(r, struct stream_reader, super);
    return fgetc(sr->stream);
}

void stream_reader_init(struct stream_reader *sr, FILE *stream)
{
    *sr = (struct stream_reader){
      .super = {stream_reader_getc},
      .stream = stream
    };
}

struct stream_reader *stream_reader_new(FILE *stream)
{
  struct stream_reader *sr = malloc(sizeof(*sr));
  if (sr) {
    stream_reader_init(sr, stream);
  }
  return sr;
}

struct reader *stream_reader_get_reader(struct stream_reader *sr)
{
  return &sr->super;
}

Defining separate public and private interfaces for the other subtype is left as an exercise for the reader.

Food for thought

If instances of subtypes are allocated on the heap and provide no extra (i.e. non-virtual) methods then do those types need to be part of the public interface?
If subtypes are not part of the public interface then how would a program destroy an instance of a subtype?
Can you rank the proposed header files in order of how important it is to for each one to provide a stable interface?
If a subtype's constructor doesn't initialise a function pointer then calling the corresponding method has undefined behaviour. What would be useful default behaviour for 'getc', and how would you implement it?

It's getting late

I hope you found this tutorial useful; I had fun writing it. If not, why not comment and let me know what other C programming topics you would like to read about?

Thanks for reading.

Top of page | Home