I cited some examples of subtyping polymorphism in my opinion piece, 'C versus C++: fight!' This tutorial develops those examples into a full program, showing why such techniques are useful and how to implement them.
The word polymorphism is from Ancient Greek, simply meaning "many shapes". Dictionary definitions of the word typically say something like "existing in several different forms".
Wikipedia provides a more practical definition for programmers:
polymorphism is the provision of a single interface to entities of different types
Hopefully this tutorial will be of use to people learning C programming, or those unsure of how such a theoretical concept applies to a low-level language such as C.
Let's imagine that we want to write a command line tool to add the values of every byte in a file of unknown length. The result could be used as a checksum to detect errors in the file, although the proposed algorithm doesn't detect reordering of bytes.
Here's a simple C program which implements this idea:
#include <stdio.h> unsigned char sum_bytes(FILE *f) { unsigned char sum = 0; for (int c = fgetc(f); c != EOF; c = fgetc(f)) { sum += c; } return sum; } int main(int argc, char const *argv[]) { return sum_bytes(stdin); }
Now, we need to test our program to ensure it doesn't contain coding errors. To do so, we could create a file containing a known sequence of bytes, pipe that input into our program, then check that the return value of the program is what we expect.
For example, such a test could be implemented as the following Bash script:
echo -en '\x1\x2\x3\x4' | checksum if [ $? -eq 10 ]; then echo "Test passed" else echo "Test failed" fi
This solution isn't ideal because the test script has limited
functionality and isn't portable to other platforms. We might
instead want to test the sum_bytes()
function in
isolation, to help locate the cause of any bugs.
We could write a test program that calls
sum_bytes()
on a file instead of the standard input
stream, but it would need to create that file first:
#include <stdlib.h> #include <stdio.h> int main(int argc, char const *argv[]) { FILE *f = fopen("testdata", "w+b"); if (!f) { fputs("Can't create file\n", stderr); return EXIT_FAILURE; } static unsigned char const testdata[] = {1, 2, 3, 4}; int err = EXIT_SUCCESS; if (fwrite(testdata, sizeof(testdata), 1, f) != 1) { fputs("Can't write to file\n", stderr); err = EXIT_FAILURE; } else { rewind(f); if (sum_bytes(f) == 10) { puts("Test passed"); } else { fputs("Test failed\n", stderr); err = EXIT_FAILURE; } } fclose(f); return err; }
The above solution requires the test to have write access to
a file system on which to store temporary files. We could
solve that by linking the test with a mock definition of
fgetc()
that reads a test vector, or by using a
macro to redefine fgetc()
as an alias for a mock
function of a different name.
Both are valid approaches to unit testing but fall into the
categories of link-time polymorphism and generic programming.
This tutorial is instead concerned with run-time
polymorphism, which allows a single instance of
sum_bytes()
to read from different types of data
source within a statically linked program.
The fundamental problem with the sum_bytes()
function is that it only operates on the type FILE
*
from <stdio.h>
. We might instead
want to calculate a checksum for data stored in read-only
memory, data read from a network, or for data output by a
decompression library such as ZLib. The C standard library's
streams are a useful abstraction, but too limited to support
all use-cases.
A naive solution to allowing the sum_bytes()
function to read from different data sources would be to use
conditional logic. For example, we could pass both the
address of a FILE
and the address of an array,
and use a Boolean argument to determine which data source to
use.
Here's a suitably modified version of the test program:
#include <stdbool.h> #include <stdlib.h> #include <stdio.h> unsigned char sum_bytes(bool read_file, FILE *f, int const *buffer) { unsigned char sum = 0; size_t i = 0; for (int c = read_file ? fgetc(f) : buffer[i++]; c != EOF; c = read_file ? fgetc(f) : buffer[i++]) { sum += c; } return sum; } int main(int argc, char const *argv[]) { static int const testdata[] = {1, 2, 3, 4, EOF}; if (sum_bytes(false, NULL, testdata) == 10) { puts("Test passed"); return EXIT_SUCCESS; } else { fputs("Test failed\n", stderr); return EXIT_FAILURE; } }
The sum_bytes()
function is becoming unwieldy
and error-prone to use because the caller must pass three
arguments instead of one. It's easy to wrongly pass true
instead of false, or vice-versa. This solution also scales
poorly because the complexity of sum_bytes()
increases linearly with the number of input data types
supported.
A more scalable solution is for the caller of
sum_bytes()
to pass the address of a function to be
called in place of fgetc()
, and a void
*
pointer to be passed when calling that function:
unsigned char sum_bytes(int (*getc_fn)(void *), void *getc_arg) { unsigned char sum = 0; for (int c = getc_fn(getc_arg); c != EOF; c = getc_fn(getc_arg)) { sum += c; } return sum; }
I've just described a single interface to entities of different types, therefore this solution qualifies as polymorphism.
The interface is still somewhat unwieldy because the caller
of sum_bytes()
must pass two arguments instead
of one. It is also less type-safe than the previous
interface: the caller must take care to pass a value of
getc_arg
that is compatible with the specified
getc_fn
. The compiler cannot help enforce that
because any (unqualified) pointer type implicitly converts to
void *
.
Lastly, the required type of getc_fn
isn't
compatible with the standard fgetc()
function,
which expects an argument of type FILE *
not
void *
, therefore the caller of
sum_bytes()
must use an adaptor function:
static int stream_reader_getc(void *cb_arg) { return fgetc(cb_arg); // implicit conversion to FILE * } int main(int argc, char const *argv[]) { return sum_bytes(stream_reader_getc, stdin); }
What we really want is for sum_bytes()
to
present a simple and type-safe interface which eliminates the
possibility of error.
Let's start by wrapping the callback function address and
associated data pointer into a struct
to avoid
them getting separated or mixed up with other variables:
struct reader { int (*getc_fn)(void *); void *getc_arg; };
The definition above creates a unique type, struct
reader
, which cannot be used interchangeably with any
other type (not even another struct that happens to have the
same members). In object-oriented programming, struct
reader
would be termed an 'abstract base class' and
its getc_fn
pointer a 'virtual method'.
Whereas the previous version of sum_bytes()
would accept the address of any callback function with a
compatible signature (argument and return types), the new
version only accepts the address of a struct
reader
object:
unsigned char sum_bytes(struct reader *r) { unsigned char sum = 0; for (int c = r->getc_fn(r->getc_arg); c != EOF; c = r->getc_fn(r->getc_arg)) { sum += c; } return sum; } int main(int argc, char const *argv[]) { struct reader r = {stream_reader_getc, stdin}; return sum_bytes(&r); }
To avoid mistakes when instantiating objects of type
struct reader
, let's create functions to initialize an
object for different types of data source. These would be
termed 'subclass constructors' in object-oriented
programming:
void stream_reader_init(struct reader *r, FILE *stream) { *r = (struct reader){stream_reader_getc, stream}; } void buf_reader_init(struct reader *r, int *buffer) { *r = (struct reader){buf_reader_getc, buffer}; }
This is already looking much safer, but I haven't yet
provided a definition of the buf_reader_getc()
function to read a test vector. Writing that function turns
out to be problematic:
static int buf_reader_getc(void *cb_arg) { static size_t bytes_read; int const *const buffer = cb_arg; // implicit conv. to int * return buffer[bytes_read++]; }
Whilst this might be acceptable for a throwaway test, it is
not acceptable for a real program! The static
storage class of the bytes_read
counter means
that only one buffer can be read using this function, and
that buffer can be read only once.
What we actually need would be termed an 'instance variable'
in object-oriented programming: every instance of
struct reader
initialized by
buf_reader_init()
must have its own copy of
bytes_read
which starts counting from 0. This can only
be implemented by allocating extra memory for instances of
struct reader
that read from a buffer instead of
from a file.
Now might be a good time to consider subtyping.
Subtyping is not supported by C in the sense that subtypes can be used interchangeably with their supertype, because the type system is not hierarchical.
However, something like subtyping can be implemented by
nesting a struct
representing a supertype within
a struct
representing its subtype. Here's that
approach applied to struct reader
, which has
been extended with a new subtype named struct
buf_reader
:
struct buf_reader { struct reader super; int const *buffer; size_t bytes_read; }; static int buf_reader_getc(void *cb_arg) { struct buf_reader *br = cb_arg; return br->buffer[br->bytes_read++]; } void buf_reader_init(struct buf_reader *br, int const *buffer) { *br = (struct buf_reader){ .super = {buf_reader_getc, br}, .buffer = buffer, .bytes_read = 0 }; }
The test program now declares an object of the new subtype
which contains extra space for instance variables,
buffer
and bytes_read
. As before, it
calls buf_reader_init()
to initialize the
object:
int main(int argc, char const *argv[]) { struct buf_reader br; static int const testdata[] = {1, 2, 3, 4, EOF}; buf_reader_init(&br, testdata); if (sum_bytes(&br.super) == 10) { puts("Test passed"); return EXIT_SUCCESS; } else { fputs("Test failed\n", stderr); return EXIT_FAILURE; } }
The test vector can now be qualified as const
(and perhaps stored in read-only memory) because the address
of this array is no longer converted to an unqualified
void *
pointer in the buf_reader_init()
constructor.
Note that the test must explicitly pass the address of the
embedded struct reader
when calling
sum_bytes()
, because struct reader
and
struct buf_reader
are distinct from the point of
view of the compiler. This would be termed 'upcasting' in
object-oriented programming, where it is usually implicit.
There are still a few questionable things about this design:
int
terminated by the macro value EOF
,
instead of an array of unsigned bytes, is awkward.
struct buf_reader
in
struct reader
seems redundant; both objects
have the same address. In any case, the address of either
object could be calculated from the other using
offsetof()
.
The first point can be addressed by adding a new instance
variable, buf_size
, to store the maximum number
of bytes to be returned by buf_reader_getc()
:
struct buf_reader { struct reader super; unsigned char const *buffer; size_t bytes_read, buf_size; }; static int buf_reader_getc(void *cb_arg) { struct buf_reader *br = cb_arg; if (br->bytes_read >= br->buf_size) { return EOF; } return br->buffer[br->bytes_read++]; } void buf_reader_init(struct buf_reader *br, unsigned char const *buffer, size_t buf_size) { *br = (struct buf_reader){ .super = {buf_reader_getc, br}, .buffer = buffer, .bytes_read = 0, .buf_size = buf_size, }; } int main(int argc, char const *argv[]) { struct buf_reader br; static unsigned char const testdata[] = {1, 2, 3, 4}; buf_reader_init(&br, testdata, sizeof(testdata)); if (sum_bytes(&br.super) == 10) { puts("Test passed"); return EXIT_SUCCESS; } else { fputs("Test failed\n", stderr); return EXIT_FAILURE; } }
Addressing the second point requires an interface change:
instead of passing a void *
pointer when calling
the virtual method to get a byte, sum_bytes()
could pass the address of the struct reader
instance. This pointer to the current object would be termed
'self' or 'this' in object-oriented programming.
Here are are the modified struct reader
and
sum_bytes()
definitions:
struct reader { int (*getc_fn)(struct reader *); }; unsigned char sum_bytes(struct reader *r) { unsigned char sum = 0; for (int c = r->getc_fn(r); c != EOF; c = r->getc_fn(r)) { sum += c; } return sum; }
Deleting the cb_arg
member of struct
reader
means that the FILE *
from which
fgetc()
reads data must instead be stored in an
instance variable. That entails defining a second subtype,
which I have named stream_reader
:
#include <stddef.h> #define container_of(addr, type, member) \ ((type *)(((char *)(addr)) - offsetof(type, member))) struct stream_reader { struct reader super; FILE *stream; }; static int stream_reader_getc(struct reader *r) { struct stream_reader *sr = container_of(r, struct stream_reader, super); return fgetc(sr->stream); } void stream_reader_init(struct stream_reader *sr, FILE *stream) { *sr = (struct stream_reader){ .super = {stream_reader_getc}, .stream = stream }; }
Updating the functions associated with the first subtype
(buf_reader_init()
and
buf_reader_getc()
) to conform to this new interface is
left as an exercise for the reader.
I used a de facto standard macro,
container_of()
, to calculate the address of a
struct stream_reader
from the address of its
struct reader
member. This would be termed
'downcasting' in object-oriented programming. It is
inherently unsafe because the compiler cannot be sure that
the given object is actually an instance of the specified
subtype.
I wrote earlier that "both objects have the same address", so
you might wonder why I used container_of()
instead of simply casting to the required subtype
struct stream_reader *
.
In part, this choice comes down to whether or not you
subscribe to the notion of inheritance. If you believe that
stream_reader
is a reader
(inheritance), then a simple cast makes perfect sense; if you
instead believe that stream_reader
has
a reader
(composition), then
container_of()
makes more sense.
Using container_of()
does have some practical
advantages:
struct stream_reader
has a member named super
; cleverer definitions
of container_of()
also force the compiler to
check that the type of super
is struct
reader
.
super
member is accidentally moved from
its position at the start of the definition of struct
stream_reader
then the program is still likely to
work.
struct stream_reader
to implement more than one
interface ('multiple inheritance').
However, container_of()
also has a big
disadvantage, if you care about writing strictly-conformant
code:
When an expression that has integer type is added to or subtracted from a pointer... If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
(6.5.6 'Additive operators', ISO/IEC 9899:1999 'Programming languages — C')
The char *
cast in the macro definition allows
the designated member to be reinterpreted as an array, but it
is not an array object, and the result of subtracting
offsetof(type, member)
may point outside it. In
practice, the given definition works for all known compilers
and target architectures.
Most programmers simply ignore the fact that
container_of()
may have undefined behaviour. It's up
to you whether you want to follow them. You can instead
continue to store a void *
pointer in each
instance of a supertype, or (if you don‘t need multiple
inheritance) ensure that super
is always the
first member of a subtype struct
. In the latter
case, container_of()
has defined behaviour
equivalent to a simple cast.
In most circumstances, I'd be happy with the solution that I
have presented above. It's simple, efficient, and type-safe
(except for my use of container_of()
in the
virtual method definitions).
When working on a larger project, or when designing
interfaces for which it's necessary to maintain binary
compatibility, you might want to hide some implementation
details. In C programs, the way to do so is using incomplete
struct
types. These can be used in any situation
where the size of a struct
is not needed and its
members are not accessed directly.
Here are a few encapsulation failures to consider remedying:
sum_bytes()
function requires struct
reader
to be a complete type in order to call the
virtual method that it uses to get data. This exposes any
instance variables belonging to the supertype.
struct
stream_reader
to be a complete type in order to
declare an object of that type, and to allow it to obtain
the address of the embedded struct reader
.
This exposes any instance variables belonging to both
supertype and subtype.
struct buf_reader
to be a
complete type in order to declare an object of that type,
and to allow it to obtain the address of the embedded
struct reader
.
The first point can be addressed by creating a header file
which declares an incomplete struct reader
type
and wrapper functions to call any virtual methods that it
provides (in this case, only one).
The complete definition of struct reader
needed
to implement subtypes must be put in a separate private
header file. Provided that the private header is only
included in source files which need the complete type, the
struct
members will be hidden.
struct reader; int reader_getc(struct reader *);
struct reader { int (*getc_fn)(struct reader *); };
#include "reader.h" #include "reader_struct.h" int reader_getc(struct reader *r) { return r->getc_fn(r); }
The second point can be addressed by writing an alternative
constructor for struct stream_reader
that
dynamically allocates memory for new object instances, and a
function to obtain the address of the struct
reader
embedded in an instance of struct
stream_reader
.
Once again, the complete definition of struct
stream_reader
needed to implement the subtype, or
allocate instances of it, must be put in a private header
file.
#include <stdio.h> #include "reader.h" struct stream_reader; void stream_reader_init(struct stream_reader *sr, FILE *stream); struct stream_reader *stream_reader_new(FILE *stream); struct reader *stream_reader_get_reader(struct stream_reader *);
#include <stdio.h> #include "reader_struct.h" struct stream_reader { struct reader super; FILE *stream; };
#include "stream_reader.h" #include "stream_reader_struct.h" static int stream_reader_getc(struct reader *r) { struct stream_reader *sr = container_of(r, struct stream_reader, super); return fgetc(sr->stream); } void stream_reader_init(struct stream_reader *sr, FILE *stream) { *sr = (struct stream_reader){ .super = {stream_reader_getc}, .stream = stream }; } struct stream_reader *stream_reader_new(FILE *stream) { struct stream_reader *sr = malloc(sizeof(*sr)); if (sr) { stream_reader_init(sr, stream); } return sr; } struct reader *stream_reader_get_reader(struct stream_reader *sr) { return &sr->super; }
Defining separate public and private interfaces for the other subtype is left as an exercise for the reader.
I hope you found this tutorial useful; I had fun writing it. If not, why not comment and let me know what other C programming topics you would like to read about?
Thanks for reading.