C is a great programming language if you want to work closer
with the machine. A couple of decades ago, C was considered a high-level
programming language; look at how times have changed. We’ve been babied with
programming languages like Java, C#, JavaScript, PHP, Python, you name it. You
can pull a car using the frame machine, but we’re going to do it with a hammer
and a chain. Why? Just to prove that we can. There are other reasons for
studying C; the primary being the speed of the program. Understanding how
everything works at a lower level makes learning other programming languages a
breeze. So, let’s get into it. There will be limited code, if any, in this
post. You can watch some YouTube tutorials for that.
C is a compiled language, meaning that you need a C compiler
to convert your readable code into machine code. Machine code is just a bunch
of zeros and ones.
C got its popularity mainly because Unix was written in C.
Currently, most operating systems are written in C and high-graphics usage
video games (C++ as well).
Since C is a small language, most of the functions are
defined in external libraries. That code is included at the top of the source
file. The compiler copies the contents from the included file and pastes it
into the source code where the include statement was entered. The most common
one is the stdio.h header file, which stands for standard input/output. This
library contains the code that’s necessary for you to write data from and to
the terminal. A couple of functions that you can’t live without that are
declared in the stdio.h file are scanf and printf.
A few things to clear up before we continue. When compiling
your code using, for example, gcc, you may have seen something like the
following written: gcc somefile.c -o somefile. You’re using the gcc utility to
create an executable file called somefile. You can enter gcc somefile.c and
it’ll still work. However, this time it’ll create an a.out file. To run it, you
would have to enter ./a.out. If you do include the -o somefile, you can run it
by entering ./somefile into your terminal. Also, the order matters slightly.
You have to enter the name of the executable after the -o attribute. You can,
however, move the -o to the beginning, such as gcc -o somenewfile somefile.c.
Also, the executable files and source code file’s names don’t have to match.
The -o means output; so, it specifies the output file.
The -o stated previously is different from utilizing the
capital O as in -O, -O2, -O3 and -Ofast. The capital O is set when you want the
gcc to optimize your code. -O3 will include all of the checks of -O and -O2.
Maximum optimization is done with -Ofast but will also take the longest to
compile. Gcc optimization is switched off by default because it takes longer to
compile the code.
If you’re writing a C program and you want to use the
command line option (like the -o option), you’ll have to read it with the getopt()
function. First, include the unistd.h header file; unistd.h is not part of the
standard C library, but is instead part of the POSIX library. To read each of
the command line options, place the getopt() function into a while loop, i.e.
while((ch = getopt(argc, argv, “do:”)) != EOF) {…}. The last argument provided
to the getopt() function states that both d and o are command line options,
that the o option also needs a command line argument and that that argument
will be included immediately after the -o option. To get the command line
argument, you’ll use optarg variable once the “o” command line option is
matched. Your command line options can also be combined as long as the option
that requires an argument is written last (i.e. -do something). If you want to
include both command line options and negative numbers, you can split the main
arguments using the -- (i.e. gcc_custom -do somefile -- -5 somefile.txt).
You’re probably wondering why we must use the period
forward-slash before the file name. If you were to enter somefile without the
./, the operating system would try to find the file in a directory specified in
its environmental path variable. If it’s not there, it’ll let you know that the
file doesn’t exist. To get around that, you’re telling the terminal to look in
the current directory to execute the file: ./ specifies the current directory.
If you really want to eliminate the ./, you can copy the folder’s absolute path
that contains your source file and append it to your environmental path variable.
One last thing, if you’re using gcc on a Linux distribution,
the compiler will have created a file named somefile after executing the
command gcc somefile.c -o somefile. On Windows, the name of the file will have
a .exe appended to it.
Strings
C doesn’t support strings out of the box. Strings are stored
in an array of characters. When printing out characters to a screen, the printf
function looks for the null character ‘\0’ to know when to terminate a string.
Each character is stored in 1 byte of memory. If an array of characters has 10
elements, it would occupy 10 bytes of sequential memory. If the printf
statement doesn’t encounter a null statement, it might continue into the next
memory cell, which would not be beneficial for us since we have absolutely no
idea what’s stored there. So, when creating a string, make sure that the
array’s length is the length of the string plus one.
This also brings up another frequently asked question: why
do array indices start at zero and not at one? To access the first element of
an array, the variable that’s assigned that memory address knows how to locate
that memory address. It doesn’t know the memory address location of the next
element. What we do know is that arrays are stored sequentially. So, if we have
a character array, and we know that each character takes up only one byte of
memory, then we can say that the next character is 1 byte away from the
starting point. The first element is similarly 0 bytes away from the starting
point. Each index is an offset which literally translates to the distance from the
starting point.
You can declare a string character array several ways, but
here are two in C: by using the string literal or defining and populating the
array manually. If you initialize a character array using the string literal,
you’re creating a constant and it’s not changeable at that point since
constants are stored in the read-only-data segment. When declaring a character
array and later populating it, the storage is allocated on the stack and each
element is mutable. You can also store strings, using the malloc function, on
the heap. In other programming languages, like Java, the new operator is used
to allocate space on the heap.
What else can you do with strings in C? You need to check
the string.h header file for useful function declarations. As a side note, the
.h file is a header file that contains declarations. Most of the time,
programmers will not give you access to view the implemented functions, but
will provide you with the .h file so that you may review the declarations and
useful notes on how the functions work. If you’re using a Linux operating
system, you can learn more about a function by typing in man function_name into
the terminal (i.e. man strcmp). The man Linux utility stands for manual.
The string.h file is part of C’s standard library. What’s
the standard library? It’s just a collection of code that came pre-installed
with the compiler that you downloaded (or had with a Linux distribution).
Another extremely useful header file is the stdio.h that you use for your
input/output (i.e. printf and scanf). C is a very lightweight language so it
relies heavily on the standard library.
Conditional
Expressions
C follows the short-circuit evaluation technique to speed up
the program. That means that if there are multiple comparisons separated with
AND statements, if one fails we know that there’s absolutely no way that the
overall expression can be true (thanks discrete math). Sometimes this can be a
problem if you’re updating a variable in the second expression with a prefix or
postfix incrementation operator. Similarly, we know that in an OR statement one
or both expressions must evaluate to true. Since it’s going to be true either
way, if the first expression evaluates to true, there’s no point evaluating the
second one. C doesn’t have a Boolean type. C99 does allow you to enter true or
false, but in the end, it gets converted into 1 or 0 respectively. This can
cause unexpected results if you use the assignment operator instead of the
relational equality operator in your conditional expression since in C a zero
represents false and all other non-zero integers represent true.
What’s the difference between && and &; similarly,
what’s the difference between || and |. The BITWISE & (AND) and BITWISE |
(OR) force the evaluation of both sides always to prevent short-circuit
evaluation side-effects. BITWISE & and | also perform bitwise operations on
individual bits of a number.
Loops
There are two general types of loops: pretest and posttest
loops. In pretest loops, the control statement is evaluated prior to the
statements in the loop body. In a posttest loop, the statements in the body are
evaluated first followed by the loop condition. When looking at the operational
semantics of counting loops, both at “while” and “for” loops, you’ll quickly
see that they’re very similar: initialize a loop variable, test against
terminal value, evaluate statements in loop body and increment loop variable.
Each expression in C’s for loop can have multiple statements separated by commas.
C allows for the use of the break statement, which terminates the loop, as well
as the continue statement, which skips the remaining statements in the loop
body and takes the execution back to the start of the loop.
If loops are still hard to visualize, just take a loop at
the operational semantics of each one. Let’s start off with the for loop:
for (expression_1; expression_2; expression_3)
loop_body
loop_body
Looking at the operational semantics of a for loop, you can
quickly see that expression_1 is evaluated first.
expression_1
loop:
if expression_2 = 0 goto out
[loop body]
expression_3
goto loop
out:
loop:
if expression_2 = 0 goto out
[loop body]
expression_3
goto loop
out:
This is the initialization step. The loop label comes after
the initialization of the loop variable normally. After the label, the
condition is evaluated. If the condition is false, the unconditional branch
(goto) transfers the control to the “out” label location in the program. If the
condition is true, the statements contained in the loop body are executed.
After the execution of the loop body, expression_3 is evaluated. In the for
loop, expression_3 normally serves as the step size. After the execution of the
third statement, the goto statement transfers the control to the “loop” label
location in the program which if you remember comes after the execution of the
first expression.
In C’s for loop, each of the expressions are optional; the
semi-colons are not optional. Missing a second expression is the same as having
an expression that’s always true; this can potentially cause an infinite loop
unless you have an explicit-branch in your loop body. The first and third
expressions can be a series of expressions separated by a comma. The second
expression can be a multi-conditional expression. The loop body in C’s for loop
is also optional. If no statements are provided, a semi-colon must be included
after the closing parenthesis. Since numerous expressions can be evaluated in
the for loops control statement, it’s common to see for loops without a loop
body.
Counter-controlled loops were created for convenience and
since so many logically controlled loops had some sort of counting variable.
Every counting loop can be built with a logical loop; the reverse isn’t true.
The two most common logically controlled loops are the while and do-while
loops. The difference between the two is that the while loop is a pretest loop,
but the do-while loop is a posttest loop.
Like before, let’s examine the operational semantics of
both. The general form of the while logical loop is:
while (control_expression)
loop_body
The operational semantics for the while loop looks like the following:
loop_body
The operational semantics for the while loop looks like the following:
loop:
if control_expression is false goto out
[loop body]
goto loop
out:
if control_expression is false goto out
[loop body]
goto loop
out:
In the pretest loop above, the condition is evaluated first.
If false, the goto statements transfers the control to the “out” label
terminating the repetition. If the condition evaluates to true, the statements
in the loop body are executed and the unconditional branch redirects the
execution of the program to the loop label.
Now, let’s examine the general form and operational
semantics of a do-while posttest logical loop.
do
loop_body
while (control_expression);
The operational semantics are listed as follows:
loop:
[loop body]
if control_expression is true goto loop
[loop body]
if control_expression is true goto loop
Examining the operational semantics of the do-while loop we
notice that the statements contained within the loop body are executed at least
once and are performed before the condition is evaluated. If the control
expression is evaluated to true, the goto branches to the loop label.
Functions
You must specify a return type for each function. If the
function is not returning anything, void is used as the return type in the
function declaration. Unless specified, arguments that are passed to a function
are passed by value. The programmer can specify that the parameters should be
“pass by reference.”
When returning a pointer, make sure the pointer was declared
outside of the function. A pointer variable declared within the function will
be placed on the stack; the scope of a local variable is from declaration to
function end. Something else to be cautious of is when passing pointers to
arrays as parameters. Calling the sizeof operator on the array pointer prior to
function call will provide you with the correct size of the array, however, if
attempting to use the sizeof operator on a parameter, the sizeof operator will
display the size of the pointer variable, not the array. Make sure to pass
another argument to the function that contains the size of the array if you
need that information. If you’re passing a pointer argument to a function and
you don’t want it to be accidentally modified, include the keyword const before
the pointer’s data type (i.e. const int *num).
Once in a blue-moon you’ll write some code that’s mutually
recursive (i.e. function one calls function two and function two calls function
one). In this case, there’s really no way to arrange the functions so that the
C compiler will be happy; you have to declare the functions prior to calling
those functions. Even better than placing the declarations (called prototypes
in C) in the same document would be to place them in a header file. Function
declarations are necessary since C doesn’t allow forward referencing of
functions; they’re needed for static type checking.
When including your custom header file make sure to wrap it
in double quotes to tell the compiler that it’s a local file (search via
relative path) and not in a directory where library code is located. You can
place the full pathname in your include statement if you’re including a header
file with double-quotes. After the compiler finishes preprocessing the code,
the header file code will be “copied” to the point where the “#include” is
specified. The compiler doesn’t actually create a new file, instead it “pipes”
the information through the compilation process. If you’re including a header
file whose definitions are located in another source file, you’ll have to
specify both source files when compiling (i.e. gcc file_a.c file_b.c -o
file_a). If your function is returning an int value, even if you don’t declare
a function prior to it being used, the compiler will still compile the code
correctly. Why? When the compiler gets to that portion of the code, it’ll
assume that the function returns an int since that’s what majority of the
functions return.
C supports variadic functions which are functions that
accept a variable number of parameters. To create your own variadic function,
you’ll first have to include the stdarg.h header. When defining a function,
you’ll have to specify that the function will be a variadic function by
including the “…” ellipses after the parameters of the function. Within the
function, you’ll need to create a va_list (variable argument list) that will
store the extra arguments that are passed to the function. After you create the
va_list, you’ll also have to specify the last fixed argument with va_start
macro; va_start accepts two parameters: the va_list and the last fixed argument
of the function. To finally read the arguments, you’ll use the va_args macro.
Va_args accepts two parameters: the va_list and the type of the argument passed
to the function. Once you’re finished reading the list of arguments, you’ll
need to tell C that you’re done with the va_end macro; va_end accepts one
parameter: the va_list. To create a variadic function, you’ll need to have at
least one fixed parameter.
Function names are pointers to the function; the pointer
variable contains the address of the function. If you have a function drive(),
then drive and &drive are both pointers to the function. The function
pointer name is a constant. To create a pointer variable that points to the
function name you’ll have to specify the return type of the function, the name
of the pointer variable wrapped in parentheses and the parameter types that the
function that you’re pointing to has (i.e. char**(*var_name)(int, char*)). This
is normally done when you’re passing a function as an argument to another
function or if you’re creating an array of function pointers. Certain
object-oriented languages that are built with C utilize function pointers to
create many object-oriented features.
Pointers
What is a pointer? A pointer is a memory address (a
variable) that stores another memory address as its value. We can use that
memory address to find our way to the particular area in memory. If you
remember earlier, I mentioned that parameters to a function can be passed by
value. If you pass an extremely large amount of data by value, it means that
the function must make a copy of that data and store it locally. Local
variables (variables declared within the function) are stored in the stack. If
the value that you just copied is too large, it can cause the stack to run out
of memory. Also, copying such large objects (not the be confused with objects
in object oriented programming) is time consuming. It’s much easier to just
pass the address of where the object resides.
As a side note, why do functions store their variables in a
different section of memory? One reason is scope of variables for recursive
functions.
Imagine the following piece of code being evaluated:
int i = 0, j = 1;
i = 2;
j = I;
In the example above, if the variable i is located on the left-hand side of the
expression, the value replaces the contents of i. If i is located on the
right-hand side, the value of i is assigned to j. Make sure to understand that
basic concept. Once you do, dereferencing a pointer on the left-hand side
causes the value of the memory location that the pointer is pointing to, to
change. Dereferencing a pointer on the right-hand side of the expression causes
the value to be retrieved from the memory location that the pointer points to. So,
in other words, the * operator can read the contents of a memory address or set
the contents of a memory address that the pointer is pointing to.
To assign a memory address of a scalar to a pointer
variable, you must use the & operator to get the memory address of the
scalar. You also have to make sure that both the pointer and the scalar are of
the same data type. Why do pointers have types? In a couple of paragraphs, I’ll
describe pointer arithmetic. But generally, if you were to add 1 to a byte, or
1 to an int, the arithmetic needs to be different since a byte occupies 1 byte
and an int occupies 4 bytes of memory. If an array stores integers as values
and you want to go from array[0] to array[1], you need to move 4 bytes away
from array[0].
Arrays can be used as pointers. The array name stores the
memory address of the first element of the array. If you print the memory
address of the array name and the memory address of arrayName[0], you’ll notice
that the memory addresses are identical. Array variables can’t point to
somewhere else though. Also, when using the sizeof operator to check the size
of the array, the compiler will tell you the size of the array. If you use a
pointer that points to the first element of the array, or the array name, the
compiler will lose the information about the array and will only give you the
size of the pointer variable, which is 4 bytes in 32 bit machines and 8 bytes
in 64 bit machines. The loss of information is called decay.
Since the array address is a number, you can do pointer
arithmetic to add integers to the pointer and subtract integers from the
pointer. If you create two pointers, one for example pointing to the first
element and the other pointing to the third element in the array, you can
subtract pointers from each other. In array pointer arithmetic, you cannot add
two pointers together.
Arrays
If you understand arrays in a different programming
languages, it should be simple to understand arrays in C as well. An array is
stored in sequential memory addresses with array element zero acting as the memory
address that can be referenced; subsequent arrays can be accessed through
offset calculations. An array can store any data type as long as they’re of the
same type. Arrays can also store other arrays; these types of arrays are called
multi-dimensional arrays. There are two types of multi-dimensional arrays:
jagged and rectangular. C’s two-dimensional arrays are always rectangular. What
does that mean? Let’s say that you wanted to store strings (character arrays)
into an array. The length of the second dimension will have to equal the
characters of the largest string plus one (for the null character). The smaller
strings will have the null character fill the unused spaces. A two-dimensional
array is stored contiguously in memory so if you have a two-dimensional
array[3][3], to access the third element of the second array you may write
array[1][ 2]. Since we know that two-dimensional arrays are stored contiguously
in memory you can also access that element by writing array[5].
You can also create an array of pointers which is just a
list of memory addresses stored in an array. This way, you don’t have to
declare a second dimension; each pointer can be stored in a single-dimensional
array even though the values that they point to (i.e. strings) may have
different lengths. The pointers still have to be of the same type (for pointer
arithmetic).
Structs
A struct (structured data type) is like an array; arrays elements
are accessed via indices while struct elements are accessed via field names. Arrays
require that the data type of the elements be the same while struct fields can
have different data types. To get the total memory size of the struct,
calculate the size of each field and add them together. Fields are stored
sequentially in memory in the order that they’ve been declared within the
struct. Once a struct is created, the length is fixed regardless if all the
fields are used or not; the maximum amount of space is allocated for each
field.
Adding an identifier after the keyword struct will create a
new data type that you can use to assign to some new variable. When declaring a
new variable with the data type of a particular struct, you have to include the
word struct prior to the struct data type name (i.e. struct vehicle lambo). When
defining a struct variable, make sure to place the values in the order that
they’re declared within the struct (i.e. struct vehicle lambo = {“Murci”,
“mph”, 220};. To access a field within a struct, you would use the dot (.)
operator (to update and read values). If you assign the struct to another
variable a copy of the struct is made and new memory is allocated. When dealing
with complex structurers, sometimes it’s necessary to nest structs. You can
access the nested struct with the dot operator again (i.e.
lambo.Murci.topSpeed); The nested struct can be initialized in a similar
fashion as a single struct (i.e. struct vehicle lambo = {{“Murci”, “green”},
“mph”, 220}. If the variable is a pointer to a struct then you’ll need to
dereference the variable prior to referencing the field (i.e. (*lambo).speed).
The -> operator can be used (i.e. lambo->speed); it combines the
dereferencing of the pointer variable and field referencing. When using the
dereferencing symbol “*” and the dot operator (.) make sure that the
dereferencing is wrapped in parentheses since the dot operator has higher
precedence over the dereference operator.
To eliminate placing the struct keyword prior to variable declaration,
you can use the typedef operator and place the identifier, which will act as an
alias for the struct, after the closing brace. Once the data type has been assigned
an alias, you may use only the type name in front of the variable name (i.e. vehicle
a = {…};
Unions
Unions are used when a variable may contain different data
types throughout its lifetime. A struct can be used, however, due to how
structs are implemented, memory space will be wasted. When declaring a union,
the compiler will allocate enough space for the largest field within it (i.e.
if a union contains an int and a float, it will allocate enough space for a
float). Regardless of how many fields are defined within a union, each value is
assigned to the same memory address.
A union looks like a struct other than the keyword union
being used. Typedef can be used in unions as well to create an alias for the
data type. You can use the designated initializer to initialize a union by
field name (i.e. height x {.euroStd=1.1};). You can also set the value with the
dot notation after the variable has been declared (i.e. height x; x.euroStd = 1.1;).
You don’t have to initialize a field by name; you can obtain the value of it by
calling the variable name directly. Unions can be declared within structs to
have a field that can accept different data types and potentially save memory
space. You can access union fields with the dot “.” or “->” operators.
For both structs and unions, when an identifier is placed
after the closing brace without using the typedef, the struct or union data
type is assigned to the variable.
Other things to think
about
On larger projects, you don’t want to recompile all the code
each time you make the change. First, make sure that you have object files of
everything using the command gcc -c *.c. The -c specifies to the gcc compiler
that it should create all the object files but not link them. After the object
files are created, you need to run the gcc -o file *.o which will link all of
the generated .o files. The compiler will skip most of the compilation process
and will begin linking them together to form an executable. If changes are made
to a single file, you’ll only have to recompile that one file using the -c
option outlined above (of course specifying your file name instead of using the
* symbol). You will have to link all of the files again to create the
executable but it’s a drastic reduction in compilation time. You can automate
this process using the “make” build automation tool.
When you need to allocate memory at runtime, you’ll use the
malloc function. Malloc take a single parameter; that parameter tells the
malloc function how many bytes to allocate on the heap. Since most of the time
you’re not going to know how many bytes you’ll need, the malloc parameter
almost always utilizes the sizeof operator. To be able to use the malloc
function, you’ll first need to import the stdlib.h library. Once the memory has
been allocated on the heap, the malloc function returns a general-purpose
pointer (void*) to the newly generated space. Although it’s not necessary, most
programmers will cast the general-purpose pointer to a specific data type.
Programmers should always use the free function to deallocate memory on the
heap. If not, there’s a possibility that a memory leak may occur. If a memory
leak does occur, you can use a Linux utility like valgrind to locate it.
Valgrind has its own version of the malloc and free functions; it will
intercept your code and keep track of the code that calls for heap allocation
and deallocation. Valgrind works best if your compiled executable contains debug
information (to add debug information to your code, use the -g option with
gcc).
Fun fact, if you look up the definition of heap, it’s “an
untidy collection of things piled up haphazardly.” The heap in memory is called
the heap because it stores data in an unorganized way.
Comments
Post a Comment