clang

WHY C IS NOT MY FAVORITE LANGUAGE

S. A. MOORE

LANGUAGE HISTORY

The C language flowed, as the official white book lore goes, from the
language BCPL. After that, the language B came, followed by C. These
two language names are from the first and second letters of BCPL.
However, there lies strong influences on the language from what was
going on in FORTRAN, as will be explained.

LANGUAGE PRINCIPLES

The idea of the C series languages is simple. If you restrict the data
types available in the language strictly to types that can fit into a
single machine word, your implementation details go down dramatically.
At first this seems like a strange idea; what about records, arrays
and other complex types ? Well, even with languages that treat these
as fundamental types, the actual IMPLEMENTATION on the target machine
will translate references to these objects into terms of machine addresses
or "pointers". With C, you get a rich series of pointer operations,
and you essentially "roll your own" complex type accesses.
Now it is possible to take this to different extremes. Take a hypothetical
language:

stringa[100];
stringb[100];
stridx[4];

for (stridx = 0; *(stringb+*stridx); mylab++)
stringa+*stridx = *(stringb+*stridx);

Where:

label[x] - Allocates x bytes of storage and sets label to the constant
address.
a = b - Places a single machine word from the expression b to the location
given by address a.
*a - Gets the word at the address a.
a++ - Increments the word at the address a.

This language has the attraction that no complex typing is needed at all. The
only thing the compiler does is allocate space and equate constant symbols
to the start of that space, much as an assembler would. Of course, the notation
for a single machine word:

stridx[4]

Is a lot machine dependent (4 bytes make a word), but that can be solved using
macro constants:

stridx[WORD];

The two problems with this language are that it does nothing to help with the
creation of complex data types, and is incredibly pedantic about forming
references to objects (is it the address itself, the contents, what it points
to ?).

KUDOS FOR C

C makes a good compromise solution for machine oriented language design. The
compiler understands complex types and how to define them. It uses a novel
scheme to unify pointers and arrays, and uses the pointer principle to
eliminate the difference between arguments passed by reference and passed by
value. The modularity of C is excellent, and the calling conventions so
flexible that it was and is possible to rewrite a given compilers low level
support functions.

THE PROBLEMS

LACK OF REFERENCE CHECKING

Number one on my mind of problems would be the lack of access bounding by C.
From FORTRAN on, most languages were founded on the idea that storage was
both defined and protected by the language. Indeed, one of the biggest
advantages of programming high level languages vs. assembly is the ability
to control reference problems. With the inclusion of unbounded pointer
accesses, C imports the worst problem with assembly language.
After two decades of C's ascendance to the most common computer language on
Earth, I truly believe that the use of C has been a major contributor
to unstable programming. Most software on the market crashes on uncommon
tasks, or even just because its storage or that of its operating system
becomes full or fragmented. The use of memory management hardware to contain
access problems has only changed this situation from a wholesale system
lockup to an error message (that is meaningless to the user) and lost files.
A new language, Java, was created mostly to cover the fact that C's references
cannot be contained.
C encourages lost references a number of ways. Most programmers walk pointers
through arrays, which makes where you are in the array impossible for the
compiler to verify. Any item in C can be addressed by a pointer made up on
the spot. This includes locals, allowing pointers to be made to variables that
may or may not even exist !
Today the most common error with C is something like "segment violation", which
means one of your pointers went wild. The most meaningless of error messages,
you get perhaps an address (machine language) where the fault occurred and
if you are lucky, a prayer. To paraphrase another author, "this botch, if
corrected, would make the language an order of magnitude more useful".

LACK OF TYPE CHECKING

Conversion of types is the standard pastime of C programmers. Since every
language manipulated object is by definition a word (except, strangely,
floating point, which appears to live by a separate set of rules), they should
be freely convertible. The conversion of types was left pretty much to the
programmer in the first C version. The result were pointers left in integer
variables and other horrors. The "fix" for this is hardly better. Now anything
is ok, as long as you announce you are doing it with a cast. Despite what
C programmers have told themselves, the new way does not even allow the
compiler to enforce typing rules. Pointers are commonly passed as VOID,
and what a pointer actually points to is between the programmer and god.
I have lost track of the number of times I am told that only bad programmers
work this way. The API for windows, OS/2 and other operating systems
feature call parameters that are routinely mangled (not just in the C++
sense). A passed parameter can contain a pointer to a structure; or be
zero (NULL), which has meaning to the callee; or be a small number, based
on the fact that pointers a generally large numbers.

MACROS

Nowhere are the reasons for a language feature more obscure than C's reliance
on a macro preprocessor. The only real need for this hidden, extra pass is to
define constants, a feature that could easily have been included in the
language. I have always harbored a suspicion that macros hurt more than they
help, and the C language turned this into a conviction. Macros shred the
logical structure of the program. Where macros are extensively used, it
becomes a challenge to figure out exactly what the actual code looks like.
Errors become mysterious, and the line numbers become meaningless (some
compilers attempt to minimize this problem). The reason macros are a bad
idea is that they have no intelligence associated with them. They cut and slice
program code with total disregard for how the program syntax is laid out.
When macros became popular in assembly language programming, some said that
this amounted to high level language programming. But macros lack any
ability to check the correctness of or optimize generated code.
The inclusion of macros also belies a link to FORTRAN. Because FORTRAN was
considered an out of date language with bad syntax (it was), a macro
preprocessor was one solution to make the language useable. But here we have
C, supposedly designed without the blemishes of FORTRAN, that is fitted with
this crutch from day one.
Most good C programmers don't use macros for anything but defines. The result
is an extra, slow text searching pass that can't be gotten rid of.

INTERPRETIVE I/O

It truly amazes me how people can preach about the "efficiency" of C when it
uses interpretive I/O. The famous (infamous ?) printf statement must scan and
parse a format string in order to perform I/O, at runtime. This design error
is a direct copy of the "format" statement in FORTRAN !
After that, it is probably a minor issue that the I/O statements are misnamed
and misformatted. printf (print formatted) may or may not actually go to a
printer, but fprintf (file print formatted) is a total oxymoron. There is
no printing going on, even to a user console screen (unless redirection is
happening). If you are ready for that, you are ready for sprintf (string print
formatted), which has nothing to do with any I/O device whatever !
The creators of C can justifiably pat themselves on the back for the fact that
in C, at least, I/O statements are truly independent of the language. But this
stupid pet trick has been performed elsewhere before (Algol). The syntax of
printf means that even the compilers that attempt to perform good parameter
checking must give up and let everything through. Moreover, the system fails
to really specify I/O in a completely portable way. The exact method to get
at printf parameters is implementation defined. The standard workaround for
this, va_args, force C to be defined as a stack implemented parameter
passing design, hardly the most efficient method. Some compilers just revert
to a stack based format only when the variable argument type is presented
(...).
Moving down in the I/O chain, we have the lovely tradition of get and unget.
C addresses the legitimate need for lookahead with a pure hack. The programmer
simply puts back what he does not want, or puts back anything else that comes
to mind.
Finally, the method for finding out if the end of a file has been reached is
to go ahead and read past the end of the file ! The ripples of this hack
spread far and wide. An integer (not a character) must be used to read from
a character file, because the end value is not representable as a character
(-1). Error checking is implausible. It obviously is not an error to read
right through the end of the file, so what is ? the second attempt to read
EOF ? Most commonly, the result of an EOF missed mishap is tons of garbage
spilled to screen or file.
Things get worse at the lower level. read (which like it or not is pretty
much defacto standard C) returns only an error for passing the end of the
file. So you got an error on read. Did you reach the end of the file ?
Is it a disk error ? Is the computer on fire ? The program now descends to
sorting out error codes from yet another, error number function to find out.

WHAT IS AN END OF LINE ?

Near the order of magnitude of the reference checking problem is the infamous
"end of line" representation. Since C represents EVERYTHING as a special
character (including EOF as shown above), it of course follows that EOL
should also be a single character. Unfortunately, when C and Unix were created,
it rarely was. In fact, the prevailing standard, both now and at the time
C was created, was to use TWO characters for that job, Carriage return and
line feed, corresponding to the control movements required to get a teletype
or printer back to the start of the next line. Rather than attempting to come
up with a good abstraction for this, C (and Unix) simply decides that a
new, single character standard will be set, the world be dammed.
The general damage from this event is incalculable. To this day, Unix systems
cannot exchange files seamlessly with the rest of the world.
The damage on C is also apparent. Since the rest of the computing world did not
give up and change to fit C, C had to adapt to the standard ASCII EOL.
This it was not designed to do, and it remains one of the biggest problems
with C.
The most obvious way to adapt C to standard EOL conventions is to translate
CR/LF sequences to and from the C EOL (which is a LF alone). But at what level
does this occur ? Having that take place at the standard FILE handling
level means that this package is not available for reading and writing
non-ascii files (which it is most certainly used for, despite the getchar
and putchar names). Putting it deeper than that, in the read and write routines
is worse. Not only is translation involved, but the exact size layout of
data is compromised, ie., the data read no longer matches the data on disk.
The solution placed into the ANSI standard is to have the program specify
what type of file, text or binary, is being opened. Since this was never
required under unix, the wall between Unix and other systems is increased,
not lowered, and we now have a dependency entirely unrelated to the
type of processor being used, a true backwards step in language design.
The authors of C seem fully unrepentant for perpetrating this design flaw,
sniffing that "under unix, the format of a binary file and a text file
are identical" [from the whitebook].

TYPE SPECIFICATION

Even though type specification appears low on my gripe list, it is the one
sin that even the authors admit to. Now, we have programs to translate
C type arcania to english. C declarations are supposed to look similar to
their use. That makes sense. Why should it be less arcane to define a type
than to use it ?
Before tearing into the ridiculous syntax of some C types, it is only fair
to mention that the inclusion of formal type identifiers into C was late
in C's life, in the form of typedef. The only name for types was in the
odd syntax of structures:

struct x {

...
...

} a, *b;

Which introduced the type name x. The only apparent reason for this is to allow
self referential structures, a necessity. So structs have a name that may or
may not be used anywhere. Further, the use of "*" can always be made to
modify the type being declared.
Curiously, when actually using the type name of a structure, you have to tell
the compiler what it should already know:

struct x {

struct x *next;
...

};

struct x onevar;

Within the structure, x could conceivably be another structure member, although
the usual practice for programmers that reuse names this way is to get the
current meaning. What need there is for this clarification in the outer scope
escapes all reason. And "*" gives a means to create two completely different
types in the same list.
With the addition of typedef, things take a marked turn for the worse:

typedef struct x {

struct x *next;
...

} a, *b;

Vola ! the meaning of a and b have changed dramatically. No longer are these
objects, but now they are types. Not only that, but:

struct x

and

a

Are aliases of each other. The way I tell new programmers to deal with this
kind of fun is to use an example from the book, and forget trying to understand
it.

SYNTAX ARCANIA

One of the most maligned features of Pascal is the way the ";" is used. What
amazes me is that C is held up as an example of how semicolons should be used.
Consider:

while (1) doit();
while (1) { doit(); dothat(); }

Even though { ... } is a direct substitution for a single statement, it does not
need a trailing ";", even though C complains bitterly if any other statements
lack it. Moreover, in the identical looking:

char mystrings[] = {

"one",
"two"

};

The semicolon is most certainly required. For function predeclarations
(admittedly a newer feature), things get worse:

void dothat(void);

and

void dothat(void)

Are two completely different animals, with only a single character difference.
The first predeclares the function, the second is the actual function. Forget
that semicolon on the predeclaration and you get a series of nonsensical errors
from the compiler, as it cheerfully heads off to parse the rest of the program
in the mistaken assumption that you are defining a strangely large function.
Similarly ugly things happen when you add a semicolon to the actual function
body.

For more syntax arcania, there is little to beat the "comma paradox":

myfunc(a+1, b+3);

Looks normal. But what if the definition of myfunc is:

void myfunc(int);

Oh yes ! It means discard a+1, then use the value of b+3. Comma is an operator !
Lets go out on a limb and say the compiler can figure this out by looking at
the definition of the function, and the grouping of the paraentisis. How about:

thisfunc(a, b, c);

Which is defined:

void thisfunc(int, int);

Does the call mean to discard a, and keep b and c ? or keep a and c, discarding
b ?
In compiler/language textbooks, this is technically referred to as an "opps".

OPERATORS FROM HELL

I have never met anyone who did not like C operators. There are so many of them.
And they do everything. And if you don't know the precedence by heart, you are
so hosed:

a ? b = 1: b = 2;

Uh, oh. This gives a compile error. Since "=" has lower priority than ?:, we
just told C to set b = 1 if a is nonzero, otherwise take the value of b
AND THEN DISCARD IT. Parsing continues at "= 2".

*a++;

Means "increment what a points at" (really). But what does:

*--a;

Mean ? How about predecrement pointer a, then get what it points to.
My favorite thing to do is to paste a Xerox of that precedence chart from
the white book on my desk right by the terminal, so I can see it always,
just to cover this sort of thing.
And how about:

a && b

Does this mean a logically anded with b ? What about a anded with the address
of b ? Forget your precedence chart, it says the latter. To resolve it, you
have to know that the compiler always treats "&&" as a single operator. To
REALLY get a anded with address b, you must use:

a & & b

The real crux of the problem is why... why do you need all those operators ?

a = a+1;

and

a += 1;

and

a++;

All do the same thing. But are they really necessary ? All but toy compilers
know how to change add by one to an increment operation at the machine
level. Yet many C programmers are CONVINCED that they are writing more
efficient code when they directly specify that an increment (say) should
happen. When C was new, the shift of optimization work onto the programmer
enabled smaller compilers to be built faster. But now, with compilers as
a major enterprise, C compilers must often have huge amounts of code to
deal with rewriting or even outguessing programmer statements. For example
in the code:

a++;
a++;
a++;
a++;
b = *a;

Is odd, but the programmer might have written that because he knew that
four increments were more efficient than an add with constant. But this is
exactly the kind of calculation that changes when modern cache and
superscalar instruction processors are used. Because an add constant is now
MORE efficient, the compiler can be in the position of having to UNOPTIMIZE
code that the programmer cleverly "optimized" in the source code.

ASSIGNMENTS EVERYWHERE

C makes assignment an operator. But it goes even further than that; any
variable can be incremented or decremented at any time within an expression.
So you have:

b = a = b++;

Which is by the way basically equivalent to:

b++;

Distributing the equation of variables around an expression is a wonderful way to
increase order dependence, which is why C has a prescribed evaluation order.

WHAT VALUE BOOLEAN ?

As several languages have done, the check for boolean true is reduced to
checking for a non-zero value, and there is no specific type for boolean.
This has exactly the effect that might have been predicted; boolean
truth checking extends far beyond boolean operations:

char *p;

while (*p) putchar(*p);

Because optimizing compilers know perfectly well how to deal with a check
against zero, there is no net difference between the above and:

while (*p == 0) putchar(*p);

Save that it is more arcane. In fact, quite the opposite effect is achieved.
For a check of:

if (x) ....;

A compiler might choose to implement the truth check as a bit test against
bit 0 of the value of x. The "official" value of true and false are 1 and
0 in C. But in fact, x could be ANY non-zero value, because the programmer
has been encouraged to stick all kinds of odd values into such conditionals
(by many program examples, including in the whitebook). Only the zero or
non-zero value matters, unless the compiler can determine that a logical
boolean was performed (last).
This problem ripples though the language. It seems odd that C needs both
a logical and boolean version of each operator (&& and & for logical and
boolean "and"), since the results on the "official" 0 and 1 truth values
would be identical. The reason that the logical values are required is
because the values many not, in fact, conform to the 0 and 1 convention.
Extra work is then required to bring the values BACK into conformance
in order to perform a valid boolean operation. The white book says:

while (*a++ = *b++);

Is a string copy. In fact the white book goes carefully through the steps
required to reduce the program to this terse form, and I have had this
poetic description mentioned to me many times as "proof" of the efficiency
of C. But entirely unmentioned is the fact that extra work will probably
have to be expended elsewhere to compensate for this language quirk.
The ugly fact is that those "while (p)" and "if (p)" expressions, that made the
original white book examples so sexy to programming newcomers, had a hidden
price. That was encouraging the misuse of boolean statements and operators
on non-boolean values, which made boolean handling MORE complex and LESS
efficient in C. This is a language feature being sold with all the honesty
of a used car lot.

THE CASE FOR CASE

Unusual to C, case matters. This little quirk has implications beyond the
apparent:

MyVar

Myvar

myvar

Are all different names in C. Breaking with a few thousand years of the
tradition of having capital letters serve simply as an alternate type style,
they are now wholly separate letters. We are never given a good explanation
for this. Was the world running out of good identifiers ? Did the language
authors just get a lower case terminal ? The common explanation, and the
one that the C authors vehemently deny, is that it was created for people
who could not type well, and needed the extra mode to supplement the "hunt
and peck" method. The legacy is that now, people have to take great pains
in the Unix and C world to specify the exact case used, and you must look
carefully at your programs to see if the case of labels are correct.
Thanks, guys.
In typographical terms, the capitol letters, small letters and later
italics started life as styles or "fonts" of type. They stopped being
individual type faces (or writing styles) when it was realized that their
underlying priciples could be used in any type face or style for emphasis,
and rules then evolved for their use in language text.
But programming languages are not human language text! So why should the same
principles apply to them ? The answer is simple. If, after inventing the
automobile, you were to decide that people should walk in the street,
and drive on the sidewalks because, obviously, cars have little in common
with horses and carriages, don't be shocked when you have confused the
hell out of everyone (and seriously injured some). We use infix notation
in most programming languages, and read them from left to right, not because
there is something inherently better about those methods, but because
we respect human conventions and want computers to be accepted without
having huge paradygm shifts from programming to everyday life.

WHAT IS A STRING ?

Since C has no bounding, dynamic variables are basically where you find
space for them. C leaves the issue of how to find the length of such items
entirely unresolved, leaving it up to the programmer. The one exception is
strings of characters, which have zeros automatically appended to them.
This zero acts as the sentinel to find the end of the string.
This scheme works quite well in the general case. But it has two built in
problems that become apparent with use. The first is that the length of
the string can only be found by traversing it. This can get prohibitively
expensive with very large strings, and most programs that handle such strings
give up and use a length/data combination.
Secondly, the removal of one of the character values can be a bigger problem
than realized. In many cases, we need to handle characters "verbatim",
including any embedded zeros.

MODULARITY

I have to give C high marks for having separate compilation as a goal
since day one. Few languages had that feature at the time.
However, the modularity system under C is groaning under the weight of
advanced programs. The standard method is to put all the headers for
functions, and their data, into an include file. This means that two
separate representations of each function, the header description and
the actual function itself, must be created and maintained separately.
This creates a constant source of errors when developing.
Further, problems have arisen with the same file being included by
multiple subfiles, and errors occurring thereby. A standard system has
come about to deal with this. We define a variable (macro), then check
if this is already defined on successive inclusions, skipping the
include if it is a duplicate. All of these measures are clever, but the
module system in C is still entirely macro driven, and consists of hacking
together multiple sources to make a compile. The C language itself
includes no facilities for modularity.

CONCLUSION

I bought my first book on C long before I was actually able to lay my
hands on a compiler. C ran only on the DEC PDP-11 and a few other
computers, none of which I had access to. This gave me plenty of time
to study the language and dream of getting my hands on the real thing.
In the years passing, C has gone from obscurity to vogue to necessity.
After programming with it for 15 years, and using other languages at
the same time, I can honestly say that C takes more time than other
languages of comparable functionality to debug and finalize a working
program. In addition, the final product is less readable, and less
maintainable. C remains a good language. Its simplicity is its biggest
virtue. But in no way, shape or form does it measure up to the hype
surrounding it, which often ignores the (many) flaws in the language.