paulgorman.org/technical

The C Programming Language

(Updated 2018)

Read Kernighan and Ritchie’s The C Programming Language.

Effective C

Effective C is a pretty good book.

In this chapter, you’ll learn about objects, functions, and types. We’ll examine how to declare variables (objects with identifiers) and functions, take the addresses of objects, and dereference those object pointers. You’ve already seen some types that are available to C programmers. The first thing you’ll learn in this chapter is one of the last things that I learned: every type in C is either an object type or a function type.

[…] An object is storage in which you can represent values. To be precise, an object is defined by the C Standard (ISO/IEC 9899:2018) as a “region of data storage in the execution environment, the contents of which can represent values,” with the added note, “when referenced, an object can be interpreted as having a particular type.” A variable is an example of an object.

Variables have a declared type that tells you the kind of object its value represents. For example, an object with type int contains an integer value. The type is important because the collection of bits that represents one type of object will likely have a different value if interpreted as a different type of object. For example, the number 1 is represented in IEEE 754 (the IEEE Standard for Floating-Point Arithmetic) by the bit pattern 0x3f800000 (IEEE 754–2008). But if you were to interpret this same bit pattern as an integer, you’d get the value 1,065,353,216 instead of 1.

Functions are not objects but do have types. A function type is char- acterized by both its return type as well as the number and types of its parameters.

The C language also has pointers, which can be thought of as an _address_— a location in memory where an object or function is stored. A pointer type is derived from a function or object type called the referenced type. A pointer type derived from the referenced type T is called a pointer to T .

Because objects and functions are different things, object pointers and function pointers are also different things, and should not be used inter- changeably.

If objects are just regions of memory, but ones that must be treated as the correct type, where does a C program store that type information? The compiled program does not store type information; the compiler uses type designations in the source code to understand how to correctly generate instructions to manipulate those regions of memory, but compiled machine code — the artifact the compiler spits out — doesn’t include explicit type information. (See this Stack Exchange answer.)

Objects, functions, macros, and other C language identifiers have scope that delimits the contiguous region where they can be accessed. C has four types of scope: file, block, function prototype, and function.

[…]

Objects have a storage duration that determines their lifetime. Altogether, four storage durations are available: automatic, static, thread, and allocated. You’ve already seen that objects of automatic storage duration are declared within a block or as a function parameter. The lifetime of these objects begins when the block in which they’re declared begins execution, and ends when execution of the block ends. If the block is entered recursively, a new object is created each time, each with its own storage.

Scope and lifetime are entirely different concepts. Scope applies to identifiers, whereas lifetime applies to objects. The scope of an identifier is the code region where the object denoted by the identifier can be accessed by its name. The lifetime of an object is the time period for which the object exists.

Objects declared in file scope have static storage duration. The lifetime of these objects is the entire execution of the program, and their stored value is initialized prior to program startup. You can also declare a variable within a block scope to have static storage duration by using the storage-class specifier static , as shown in the counting example in Listing 2-6. These objects persist after the function has exited.

Hello, World

#include <stdio.h>
#include <string.h>

char s[6];

void
hello() {
	extern char s[];
	printf("%s\n", s);
}


	return 0;
}

int
main()
{
    printf("Hello, world!\n");

	extern char s[];
	strcpy(s, "hello");
	hello();

    return 0;
}

Compile:

$ gcc -Wall hello.c -o hello
$ ls
hello*  hello.c
$ ./hello 
Hello, world!
hello

A very simple Makefile:

hello:
	gcc -std=c99 -Wall -o hello hello.c

clean:
	rm hello

(See my notes on make.)

Basic Syntax

#include <stdio.h>
#include <stdlib.h>

/* This is a comment.  */

#define MYCONSTANT 42    /* Note: no semi-colon. */

int myglobal;    /* External/global variable available to all functions. */

int main() {
	int a = '2';
	char yesOrNo[4];
	scanf("%3s", yesOrNo);
	char mystring[] = "All the king's horses...";
	int myarray[] = {1, 2, 3};

	if (a >= 1) {
		puts("The number is greater than 1.");
	} else if (a == 0) {
		puts("Zero");
	} else {
		printf("The number is %i.\n", a);
	}

	int i = 0;
	do {
		switch(i) {
		case 0:
			puts("0");
			break;
		case 1:
			puts("1");
			break;
		case 9:
			puts("9");
			break;
		default:
			puts("Not 0, 1, or 9");
		}
		i++;
	} while (i != 10)

	while (i >= 0) {
		puts("i is positive");
		i--;
	}

	while ((c = getchar()) != EOF)
		putchar(c);

	return 0;
}

The curly braces are optional for if blocks that execute a single statement.

Variables & Memory

Think about C as memory with some syntax sprinkled on top. A C variable is:

Writing int foo = 314; tells the compiler to pick a starting memory address, set aside enough subsequent bytes to hold an integer, and store the value 314 there.

Arrays work the same way, though the compiler sets aside enough consecutive bytes to store all the values (i.e. — an area of memory in bytes equal to size_of_data_type times number_of_elements_in_array).

Structs are similar, although the values are not necessarily stored in contiguous blocks of memory.

Pointers

A pointer is a variable. Its value is a memory address.

The * (splat/star/asterisk) is used when declaring and dereferencing (getting the pointed-at value of) a pointer. It’s called the indirection operator, but might be read as value at address.

The & (ampersand) is used to get the memory address of a value.

Pointers have a type (int *x;). This is so that the compiler knows where the value pointed at ends, and enables pointer arithmetic (e.g. for x + 1 the compiler knows how far to move the pointer — the width of a char, the width of an int, etc. — based on the type of x).

Why user pointers? We can pass pointers into functions, so our functions can affect more variables than just the simple return value.

K&R says:

A pointer is a variable that contains the address of a variable. […] Rather more surprising, at first sight, is the fact that a reference to a[i] can also be written as *(a+i).

The heap and the stack

C memory can be managed statically, automatically, or dynamically. The difference is how long the memory lives (scope) and when it begins life (compile time versus run time). Static memory is set at compile time, along with the instruction code of the program, and is allocated for the entire runtime of the program. Automatic memory is allocated on the stack, and comes and goes as functions are called and returned. Dynamic memory management gives the programmer the most flexibility (along with more responsibility) in managing the lifetime of memory allocation. Dynamic memory is allocated from the heap during the lifetime of the process with malloc() and returned to the heap with free(). Because the memory location isn’t know until it’s allocated, the program accesses heap memory with a pointer returned by malloc(). The heap is much larger than the stack.

Every function has a frame (block of memory) in the stack. That is, when a function is called, all its values are pushed onto the stack in one frame/block, and that frame lives until the function returns. Stack space in managed as LIFO—last in, first out; the most recently allocated block is always the next block to be freed. All stack memory is local; when a function exits, its stack memory is freed. There is an OS-dependant limit on the size of variables which may be stored on the stack. If you try to use more memory than the stack can contain or call too many nested functions (with each one pushing an additional frame onto the stack), it causes a stack overflow, crashing the program.

The stack is faster than the heap, because the LIFO/contiguous nature of the stack has less overhead than allocating and freeing blocks of heap — it’s just a matter of moving a pointer to indicate the top of the stack. Memory allocated in the heap can be located anywhere in the heap, and it can be deallocated in any order, requiring more bookkeeping operations across potentially non-contiguous memory locations.

(The LIFO stack is how recursive and nested functions work without clobber each other’s memory — each new one gets a new frame on top of the stack, and when it returns its caller’s frame moves back to the top of the stack.)

Each thread gets its own stack, but all the treads in a process usually share the same heap (and must coordinate safe access to that shared heap). Memory from a stack is reclaimed when a thread exits; memory from the heap lives for the lifetime of the process.

Variables on the heap are not so constrained in size as those on the stack. The size of stack is fixed at run time (or when a thread is created??), but the heap can grow during runtime. If you fail to free memory on the heap when you process ends (and before the variable falls out of scope), you create a memory leak; un-freed memory will be unavailable to other processes. When there are a lot of allocations and deallocations, the heap can become fragmented and slow (it takes more operations to allocate, read, write, and deallocate non-contiguous memory scattered across many locations in the heap)

Allocating memory on the heap may be necessary if:

Memory layout of a Linux binary

An linux ELF binary gives some clue how memory is allocated at compile time and run time. The size command shows the amount of memory (bytes) in some of the executable’s sections. The “text” section contains the program instructions. Any running copies of the program share the instructions loaded in memory. The “data” section contains all the variables initialized with values at compile time (statics, globals). The “bss” section contains only a number: the total memory size of uninitialized values the program will require at runtime. So, the number size shows for “bss” isn’t bytes stored on disk; the value stored on disk is just the number shown (which will be the bytes of memory allocated when the executable runs). The sum of the “text” and “data” sections, on the other hand, are probably pretty near the total size of the executable file stored on disk; those sections will be read into memory when the program runs.

$  size /bin/ps
   text	   data	    bss	    dec	    hex	filename
  86386	   1544	 135016	 222946	  366e2	/bin/ps
$ ls -l /bin/ps 
-rwxr-xr-x 1 root root 93096 Sep 27 19:46 /bin/ps

readelf -a /bin/ps gives more of a peak into ELF executables.

The C Standard Library

See the documentation for your standard library (glibc) or Harbison and Steele’s C: A Reference Manual.

What does the included file actually include?

Files like stdio.h are usually found somewhere like /usr/include/.

Alternately, do this to see what’s actually included by the preprocessors:

$ clang -E myprog.c | less

Tooling

How does compilation work? Source code becomes an executable in four steps.

  1. The preprocessor scans our source code — bringing in included files, resolving conditional compilation instructions, adding debug hints (linemarkers), and expanding macros — and outputs expanded C code.
  2. The compiler translates the expanded C code provided by the preprocessor into lower-level assembly code.
  3. The assembler takes the assembly from the compiler, adds offsets, and produces an object file.
  4. The linker takes one or more object files or libraries, and combines them to produce a single executable.

File types:

The loader runs programs. It read the executable file from disk, checks its headers, allocates RAM (the majority as one block for the stack and heap), and copies the executable’s Text and Data sections into primary memory. The loader then copies any command line arguments onto the stack, and feeds them to main(). A runtime linker handles loading any objects from shared libraries.

How can we inspect the preprocessor output?

The preprocessor is cpp. Normally its output is piped directly to the compiler, but we can call cpp directly and save its output:

$ cpp foo.c > foo.i

or

$ gcc -E foo.c > foo.i

How can we inspect the assembler output?

gcc -S foo.c

This generates ‘foo.s’.

Debugging

A debugger, such as gdb or lldb, affords somewhat more help than inspecting values by adding print statements to the code.

See my lldb debugger notes.

Valgrind

#  apt-get install valgrind

This suite of tools includes memcheck, which finds memory leaks, deallocation errors, etc. See the quick start guide and the explanation of error messages from memcheck.

$  valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 --track-fds=yes ./foo

Modern C

As of 2018, absent a particular reason not to, use clang for a compiler. clang defaults to a slightly extended version of C11, which is fine in most cases.

$  clang -O2 -Wall -Wextra -pedantic hello.c -o hello

During testing:

$  clang -O2 -Wall -Wextra -pedantic -Werror -Wshadow hello.c -o hello

Use Valgrind.

Further Reading