Translating C to Ada

Derek Jones
derek@knosof.co.uk
Knowledge Software Ltd
Farnborough, Hants GU14 9RZ
UK

ABSTRACT

Automatic translation provides a cost effective means of reducing the time and effort of converting C source code into Ada. Experience gained from translating C into other strongly typed languages is described. Some of the problems in using Ada as the target language are given.

Introduction

At first glance the idea of translating C into Ada would appear unusual and without practical benefit. The reverse situation of translating Ada into C has the obvious advantage of making it possible to execute Ada programs on a wider range of processors and operating systems.

There are two reasons for wanting to convert software from C to Ada. The first applies to a software house, or development group using Ada that have acquired a body of C source code. A software house staffed by Ada programmers would need training to be able to understand and support programs written in C, or new staff would have to be hired. There are also logistical difficulties in having to support software written in two languages over a period of time. Several arguments can also be made to suggest that developers who use C ought to move to Ada. However, such arguments are not the topic of this paper. The second reason for wanting to convert is to provide an interface between C and Ada. In this case C header files are converted into Ada package specifications. The question of how the differing startup and runtime models coexist is also not the topic of this paper.

This paper is not purely about translating C to Ada. The base translator described here is capable of translating C into any algorithmic language. Over the last six years a lot of experience has been gained in translating C into other languages, both in terms of mapping language constructs and knowing what users want from such a translator. So to some extent this paper is about the evolution of a C to strongly typed language translator.

The first section gives some background on the development of the translator and how it got to its current state. Section two deals with the goals of translation, what the input language is and what we want the output code to look like. This is followed by an outline of the translator itself, both in terms of its operation and its internal organisation. The fourth section deals with the practical topic of using the translator; the goal of perfect translation not yet having been achieved some manual intervention can improve the quality of generated code. Finally we look at what goes into testing a translator of this sort and give some examples of translated code.

Background

The core translator has had something of a checkered history. It started life as a C to Pascal (UCSD dialect) translator. This translator had as its goal the production of readable, but not necessarily compilable output. It was written for a company staffed by Pascal programmers who had purchased a large C program and wanted to understand it. This translator was then turned into an ANSI C compiler (which validated in August 1990), but that is another story.

Although having much in common, there was some divergence between the translator and the C compiler. The next target language for the translator was Turbo Pascal, with the aim of producing compilable output. This version of the translator was heavily tied to Turbo Pascal specific features and was released as Shareware (it can be down loaded from the languages section of the Borland bulletin board). The heavy use, where possible, of Turbo Pascal specific features occurred for two reasons, 1) it gave a higher rate of perfect translation and 2) for commercial reasons we did not want the output of this translator being fed into other Pascal compilers, we wanted to sell copies of our C to ISO Pascal translator. This translator was quite successful and had something like a 80% conversion rate (that is 80% of converted programs would compile without being edited). In some cases programs even compiled and ran without being edited (a rare occurrence). Because of the close connection between the ANSI C compiler and C to Pascal translator it was possible for one to make use of code written for the other. In particular the C preprocessor is almost identical.

The next target language was IL. IL is the intermediate language used in the MALPAS formal verification system marketed by Rex, Thompson & Partners. IL has many Ada like features. But more important than what it has is what it does not have; pointers and global variables. We are frequently asked how a language that makes very heavy use of pointers is mapped onto a strongly typed language like Pascal. In some ways mapping to a language without pointers is easier, once the bullet has been bitten. But again that is another story.

Translating C to a language to be used for formal analysis introduced yet another set of requirements than those previously handled. Also the constructs available in the target language differed in many small ways from Pascal and the generic core of the translator had to be made even more generic.

A C to Modula-2 translator has been designed but no implementation work has yet been carried out.

As it currently stands the translator is approximately 57,000 lines of Pascal (the original ANSI C compiler was 40,000 lines). The source code has been ported to an ISO Pascal compiler on the PC, Turbo Pascal on the PC and Extended Pascal on the VAX.

A version of the translator that generates Ada is currently in the design phase. Some experimental work has been done by modifying the output routines to produce Ada rather than Pascal.

Goals of translation

If you ask a user want they want from a translator they will invariable reply, `readable code that performs the same function as the original'. Also they will sometimes suggest that the pricing ought to be similar to that of compilers. In practice the number of potential customers is small and they usually only want to translate a fixed number of programs a few times. This small market size and once only translation usage determines the style of generated code (ie the market is not large enough to warrant the production of an all singing and dancing translator). The translator also has to be able to compete in cost and quality against human translation. It has to be significantly more cost effective than human translation. This is because human translations are not usually simple translations, they are normally rewrites (hence humans add value to the translation). Also given the choice most programmers would rather rewrite than translate, its more interesting. So a strong management case has to be made that the translated code will be maintainable and cost effective in the long run. The optimal solution is usually a translation that does most of the editorial work but leaves some of the stylistic issues to human tuning.

Because a standard for C became official relatively recently the question of language dialects has to be considered. It would be possible for the translator to simply flag uses of dialect features as being non standards conforming and refuse to translate them. Such behaviour is not much use when a compiler exists that will handle that program. Fortunately the dialect features in common use are well known and relatively easy to support. Since work was originally started on the translator there has been some convergence by C compilers towards the ANSI standard. Starting from scratch today we would probably not include such wide ranging support for C dialects. However, since support for a variety of dialect features exists it is worth keeping.

C also has a number of implementation defined features that compilers often handle via command line switches, ie signedness of char. The translator also handles these features via command line options. It also offers command line options to select those implementation defined features not usually selectable by the user, ie order of evaluation of expressions.

Most compilers (translators) only have to worry about correctly processing the input source code. Translators that generate C as the output language rarely expect users to read or modify the generated code. Users translating from C are usually going to spend a lot of time reading and potentially modifying the generated code. So the style of the generated code is important. Ideally it should look like it was written by a developer fluent in the target language.

What is code style? To some extent the answer to this question depends on the language being considered. Different languages are stronger in some areas than others. These strengths and weaknesses influence the frequency of use of constructs and programmers approaches to coding in that language. For instance arrays in C are very much second class constructs. So Ada that has been naively translated to C looks unnatural (because of the heavy use of arrays). Similarly C encourages the use of pointers (and in some cases requires it). So translated C code can look unnatural in a language that does not force such use of pointers.

Another common requirement is that the generated code be restructured. C has a poor reputation for code structure and users feel that a translator ought to address this issue. In the first place the poor quality of C code tends to be overstated. Ada or Pascal programmers write code that is just as poor. Secondly the job of a translator is to translate. If the code needs restructuring it should be done by a code restructuring tool on the original source. One of the goals in the design of this translator was that it should not destroy code structure. So the output has as much structure as the input. To a large extent this has been achieved.

Code layout is another issue. Users want the translated code to be well presented. Again this is a job for another tool, a pretty printer. Our translator does make some effort to give structure to the layout of the generated code. But it does not have the fine tuning available in a good pretty printer.

An unspoken assumption is that the generated code will run through any Ada compiler on any processor. Since the same assumption is unlikely to be true of the original C it is rather much to assume that the translator will magically imbue the translated code with this property. In practice a `better' translation can be achieved by making use of compiler extensions to the target language. On the whole users seem to prefer the use of compiler specific features to the contortions that are sometimes required to keep the translated output generic to all compilers.

Overview of the translator

The translator assumes that the input source code is syntactically and semantically correct. Thus in principle little checking is guaranteed to be done on the C that is fed into the translator. In practice some checks were so simple to do that they were done, while others were done as an aid to debugging the translator in its early stages of development. The close connection between the core translator and our ANSI C compiler means that a lot of checking code has migrated over. In fact the C preprocessor does all of the checking required by the standard.

In operation the translator makes three passes over the source code. Between each pass an analysis phase collates the information gathered during the previous pass.

Performs semantic analysis of the entire program. This may include multiple C translation units (package in Ada terminology). The purpose of this pass is to find and remember all global declarations and objects with static storage duration. It also attempts to deduce whether function parameters are call by value and which by address.
Performs semantic analysis on a single translation unit. Remembers all declarations and the scope in which they occurred. Flow analyses pointer usage (for substitution with array accesses), and integer variable usage (for type conversion to BOOLEAN).
Operates on a single translation unit. Semantic analysis and generation of target language. Global declarations are generated prior to any function (procedure) bodies. Local declarations are generated at the head of every procedure (may involve lifting of nested C declarations).

Phases two through three are repeated for every unit that was processed by phase one. By performing an initial pass over the entire program it is possible to work out package dependencies. This information is used to generate declarations at the head of the package as well as generating a file containing a package definition.

One of the most difficult jobs of translation is giving a unique name to identifiers within their scope. C identifiers can exist in five different, independent name spaces and can be declared in nested local scopes. Collapsing the number of name spaces and lifting nested declarations to the outer local scope may introduce name clashes. The translator renames identifiers when this occurs. Also names are given to anonymous types.

There are also a variety of ad hoc features. These include turning #define's of constants into constant declarations, unfolding the C printf/scanf functions into separate I/O statements and removing unused variables.

Converting switch statements to if then else or case ...

The source of the translator itself can be grouped into three main sections:

C preprocessor,
semantic analysis, and
code generation

There is also a library of procedures for performing runtime support of the generated code.

The C preprocessor exists as a definable entity because it is translator independent, ie the source code can be moved between our compiler and translator without affecting much of the other code. Traditionally the preprocessor has existed as a separate entity within a C compilation system. In the case of the translator we have attempted to map some of the preprocessor constructs into Ada source. Hence the necessity for integrating the preprocessor.

The code (Ada) generation section has a very sharp and well defined interface to the semantic analysis phase of the translator. This means that improvements to the translator for one language automatically become available for the other supported languages.

The semantic analysis phase (includes syntax analysis) resembles that of a conventional compiler, with the exception that few checks are made concerning the legality of the input source. The main difference from a compiler is that symbol table information is not thrown away at the end of a scope. Also name lookup may return information obtained on a previous pass that would not normally be available at that stage during compilation.

All this storing of information on declarations, that cannot be disposed of, uses up a lot of memory. A conventional compiler can throw away a lot of information at the end of a scope. The translator has to hang on to this information for the next pass. The desire to reduce memory overheads increased the number of passes from two to three. By only processing a single translation unit on passes 2 and 3 the total memory overhead is reduced. Internally a memory management unit is used to swap portions of the heap out to disk (virtual memory in software). This memory manager had been written for a previous project and had proved its worth before. Experiments showed that there was approximately 100% overhead in using this software virtual paging unit. Given that the translator was not expected to be used frequently this was felt to be a reasonable overhead.

Using the translator

The get the best results out of any translation an iterative process is recommended. The first step is to translate the existing C source code, check the log files for warnings and look at the generated code. It is possible that use is made of C dialect features not supported by the translator. These will have to be corrected in the original C source.

Examining the generated code. It may be possible to improve the quality of the generated code by making small changes to the original C. The use of many different scalar types may result in a bulky translation, because the implicit C type conversions may have to be made explicit. Cutting down on the number of types used can produce more readable output, particularly since Ada does not perform any implicit casting in mixed mode expressions.

Also, providing prototypes for C fu~ctions (a recent addition to the language made by the ANSI committee) allows the translator to perform more checks on the original C and produce more accurate output. Many C compilers do not yet support this ANSI feature. A surprising number of bugs can be found in the original source when prototypes are added.

Prior to making the `final' translation the C source should be Q/Aed to ensure that any changes made have not introduced new bugs.

Testing the translator

Given that a 100% translation is not always achieved what is testing required to show? Firstly that the translator can process a large body of legal code without producing spurious warnings, crashing or looping, secondly that the translated output conforms to the expected level. This level of expectation is set by the time/cost tradeoffs mentioned above.

Three bodies of test code are used:

Actual user programs. These have been built up over the years from public domain sources and customers programs.
The BSI C validation suite. This is the validation suite used to check compilers for conformance to ISO/ANSI C throughout Europe.
A set of hand written test programs designed to check various functionality within the translator.

Testing the translator consists of running scripts to translate all of the above source code. Any translator messages are written to log files. These log files, along with the translated output is then compared against log files considered to contain the expected behaviour.

During the early development of a translator for a new target language the generated code changes quite a lot between runs. For this phase of development the generated code is not checked against expected output (since we don't yet know what that will be, in detail) but rather it is checked by running the code through a compiler for the target language. The number and type of warnings produced by the target language compiler being a good indication of the `quality' of translation.

Once the new code generator has settled down it is possible to start examining the generated code for suitable, tuning, improvements.

Language differences

Differences in the availability of features between the source and target languages are important for two reasons:

A language construct in Ada that is not in C may create user expectations, ie that the translator will make use of the construct.
A language construct in C that has no equivalent in Ada requires translation to a different construct or no translation at all.

Given that most users of a translator are not intimately familiar with both languages, indeed may not be familiar with either, these issues have to be dealt with in depth in the supporting documentation.

The most commonly occurring constructs that do not appear in C but do appear in other languages are the boolean datatype and the concept of pass by address. Although they do not appear in the C language, uses of these concepts do occur in C programs. For the first of these cases the translator uses various huristics to determine when a C object is being used in a boolean context. Slightly more contentious is its attempt to deduce what parameters, if any, to a function might be call by address (in fact Ada in out parameters are copy in and copy out on return; in parameters may not be assigned to, hence it is occasionally necessary to create local temporaries). This latter deduction is prone to error because it is not always possible to see all occurrences of calls to a given function.

A potentially more serious problem is the occurrence of C language constructs that have no equivalent in Ada. Uses of such constructs can result in no translation, at least for the construct concerned. Pointers to functions ...

The use of varadic functions, ie functions taking a variable number of functions is the most common of these constructs. Fortunately this usage tends to mainly occur in the area of I/O, which is translated as a special case into calls to the Ada I/O package. When it occurs in other situations it is simply flagged and any extra arguments commented out (any missing arguments are added, as dummy variables to maintain compilability).

C has three sorts of integral types; short, int, long and their unsigned variants (the char type being considered equivalent to int in expression contexts). Ada contains a single universal integral datatype. Although it lacks a range of predefined datatypes Ada provides two mechanisms for users to effectively create their own. The type machinery allows new types to be created (not synonyms as in C). The ability to overload the unary and binary operators provides a mechanism of emulating the semantics of operations on C datatypes without having to introduce unsightly function calls.

Character datatypes in C have some very integral type properties. Although named char this C type acts more like an 8 bit integral type. It can be mapped onto CHARACTER, but a fair amount of type transferring usually occurs. Also this mapping is C compiler dependent, some treating char as a signed quantity while others treat it as an unsigned quantity. Strings are terminated with a byte with all bits zero. Many C programs make extensive use of the null byte at the end of strings. The best approach to translation is to maintain this convention in the translated program (unless the user is willing to spend a considerable amount of effort changing the code, usually in some very subtle ways).

Cross translation unit dependencies are sometimes visible in C through the use of header files, or looking at the make file. But once again common C coding practice is against us. To reduce the number of dependencies and hence compilation time, many developers place all global declarations (objects and functions) in a single header file. Thus the translator has to work out what the actual dependencies are. This information is written out on the third pass as WITH clauses.

Some examples

C expressions can be loaded with side effects. Using these constructs is actually considered to be `efficient' coding:

i = j++ - ++k + ++l;

becomes:

l := l + 1;

k := k + 1;

i := j - k + l;

j := j + 1;

The order of evaluation used by the translator is user selectable between right to left and left to right. Had left to right order been selected in the above case the first two Ada statements would have been reversed.

if (i++)

   k = 0;

becomes:

temp_1 := i;

i := i + 1;

if  temp_1 /= 0 then

    k := 0;

    end if;

here a temporary variable is created by the translator. The use of an overloaded unary operator to handle this case was rejected on the grounds of style. Ada being viewed as a relatively side effect free language.

Names clashes can occur through expanding name scopes, merging name spaces or with Ada reserved words:

struct j{int f;} s; /* j in tag name space */

int RECORD = 0,

     j; /* j in identifier name space */

becomes:

type j is

         record

         F :INTEGER;

         end record;

declare

     s : j;

     RECORD_1   :INTEGER := 0;

     j_1 :INTEGER;

Below are two special cases designed to make the translated code more readable:

#define MAX_X   50           /* Maximum number of X's */

char x[MAX_X];

becomes:

MAX_X : constant INTEGER = 50;   /* Maximum number of X's */

x  :ARRAY[0..MAX_X-1] OF char;

A comment on the same line as an object like macro is stored with the body of the macro. Thus it is possible to generate the comment on the same line as the constant declaration.

In strictly C terms preprocessing happens prior to syntax and semantic analysis. This means that all macros occurring in the source have disappeared by the time that semantics comes onto the scene. The generated code would be much closer to the intent of the original if some macro names could be retained. In our case a macro is retained name if it is a constant expression (a macro body is a constant expression if it only contains constant literals and/or macros that are constant expressions). So in the above case the array has an upper bound of MAX_X-1 rather than 50-1.

C does not impose any ordering on declarations. This causes a slight problem in that functions may be called before being defined. This is handled in C by giving function declarations at the start of the source module, or in header files. An Ada programmer would arrange procedure definitions such that they occurred prior to being called. No such discipline exists in C.

int f(int);

int f(int a)

{ }

becomes:

function f(a :INTEGER) return INTEGER;

function f return INTEGER is

begin

end;

Again this is one area where a small amount of reorganisation of the original source code can have a large effect on the generated output. Not only can the ordering of functions in a source file be sloppy but also the placement of functions in the files making up a complete program can be lax. But then that is not a problem unique to C.