Catch errors that are not caught by parsing (because many constructs are not context-free).
Examples (C):
void foo (int n) { int a = 0, i; printf("a = %d\n", a); for (i = 0; i < n; i++) { int a = 1; printf("a + j = %d\n", a + j); //a's value is 1. where is j. error? } printf("a = %d\n", a); }
void foo(int x) { int a = 4; bar(3); } void bar(int y) { printf("a = %d\n", a); //refers to a in the closest enclosing binding in the execution of the program }
n
n
n
{ //new scope int x = 4; //declarations f(x); } Tree: NewScope -->(left child) declarations x = 4 NewScope -->(right child) statementsOn entry to the new scope, the new declarations will be added to the symbol table before recursing down right child (statements). On returning from the new scope, the new declarations will be removed and the old declarations of x (if any) will be restored in the symbol table.
enter_scope()
: start a new nested scope
find_symbol(x)
: search stack starting from top, for the first scope that contains x
. Return first x
found or NULL if none found
add_symbol(x)
: add a symbol to the current scope (top scope)
check_scope(x)
: true if x
defined in the current scope (top scope). allows to check for double definitions.
exit_scope()
: exit current scope
add $r1,$r2,$r3What are the types of $r1, $r2, $r3? Can't say. For all you know, r1 may represent a bit-pattern that represents a string. The only thing we can say is that these have the base-type bitvector.
e1
has type Int
and e2
has type Int
, then e1+e2
has type Int
(e1:Int ^ e2:Int) => e1+e2:Int
(e1:Int ^ e2:Int) => e1+e2:Int
is a special case of Hypothesis1 ^ Hypothesis2 ... ^ HypothesisN => Conclusion
|- Hypothesis ... |- Hypothesis ------------------------------- |- Conclusion
|- e:T
|-
means "it is provable that ..."
i is an integer literal (constant) ---------------------------------- [Int] |- i:int
|-e1:int |-e2:int ---------------------------------- [Add] |-e1+e2:int
1 is an int literal ------------------- |- 1:int 2 is an int literal ------------------- |- 2:int |-1:int |-2:int ----------------- |-1+2:int
e:T
e
:
e
's subexpressions
e
S <: T
is read "S is a subtype of T" or "T is a supertype of S".
S
is a subtype of T
if any term of type S
can be safely used in a context where a term of type T
was expected.
int
and bool
. Let's hypothetically say that every operator that is defined for a term of type bool
is also defined for a term of type int
, e.g., 1 && 3
evaluates to say 4
(for some strange definition of the semantics of &&
on integers. But that does not necessarily mean that int
is a subtype of bool
.
class A { ... }; class B : A { ... }; A a = B();
S <: S
S <: T
and T <: U
implies S <: U
.
S <: T
and T <: S
implies S=T
.
char
is not a subtype of int
. Similarly, int
is not a subtype of float
.
void foo(int i) { ... }
. Strictly speaking, it is not legal to call foo('c')
(where 'c'
is of char
type. Similar is the case for float
and int
.char c
to int
. This casting could also have been done explicitly by the programmer, to make the program legal. Recall that type casting is a backdoor provided in C to allow type conversions for programmer convenience.
char
a subtype of int
? Perhaps, because they wanted to allow the programmer to have (among other things) two versions of a function foo
, one that accepts a value of type char
as an argument, and another that accepts a value of type int
as an argument, also called function overloading.
char
as a subtype of int
, and choose the most precisely fitting function definition for foo
in case there are multiple choices? That is an interesting proposal, but then we also need to tackle the case where foo
takes multiple arguments, and one definition is foo(int, char)
and another is foo(char, int)
. It is quite unclear what to do in this case.
v
be of type T
.e:T
.e:T
, then e
evaluates to a value of subtype S<:T
.|- 1 : int |- 2 : int ----------------------- |- 1+2 : void
|- 1 : int |- 2 : int ----------------------- |- 1+2 : int
e
evaluates to a value of type T
,
then e:T
is provable. However, achieving completeness is not possible in the presence of loops. For example, completeness requires that all programs that can produce a value v:T
should also statically evaluate to type T
; however, in the presence of loops, a value may be computed only if the loop terminates. If the loop never terminates, the type-checker should assign a type that agrees with non-termination (e.g., void
to indicate that this value can never be used, or a stronger type that indicates non-termination explicitly). This would amount to statically identifying if the loop terminates, but that is the halting problem. There is more discussion on completeness later.bool
and int
. Important: void
is the supertype of all types.------------------ [false] |- false : bool
------------------ [true] |- true : bool
s is a character literal ------------------------- [char] |- s : char
s is a string literal ----------------------- [string] |- s : char *
|-b:bool ----------------------- [Not] |- !b : bool
|-x:int |-y:int ----------------------- [Relational] |- x<y : boolNotice that
!i
where i:int
is not well-typed in our language.
Similarly, true<false
is not well-typed in our language.
Two expressions (or statements) may be sequenced using the semi-colon operator, such that the type of the compound expression is the same as the type of the second expression. If a new scope is created with an expression in its body, then the type of that new scope is the same as the type of its body.
|- e1:T1 |- e2:T2 -------------------- [Sequence] |- e1;e2 : T2
|- e:T ---------- [Scope] |- {e}:TFor, example, the expression
{ { 2 + 3; 4; }; 5 }
evaluates to 5.
Now let's consider an assignment expression (or statement). Recall that an assignment expression evaluates to the value that has been assigned, and its type is the type of the assigned variable.
|-x:T0 |-y:T1 T1<:T0 -------------------------- [Assignment] |- x := y : T1This rule seems to be giving a valid type to the following expression:
1 := 2
, but this would be incorrect in our language. To support
an assignment, we need to
distinguish between a reference type and a value type.
For a primitive type T
, its corresponding reference type is written T&
.
A reference type represents an object (region of memory) holding the corresponding value type.
By
definition, T&
is a
subtype of T
, i.e., T& <: T
.
Example: the following subtyping relations can be present in our language: U<:T
, T&<:T
,
U&<:U
, U&<:T&
(to form a diamond structure).
Thus, the assignment rule now becomes:
|-e0:T0& |-e1:T1 T1<:T0 ------------------------------- [Assignment] |- e0 := e1 : val(T1)Here
val(T1)
returns the value type of T1
if T1
is a reference type; else it returns T1
itself.
Notice that the conclusion assigns a value type to the assignment expression, which rejects (x := y) := z
expressions, but accepts x := (y := z)
.
If we want to admit (x := y) := z
, we may want to change the inferred type in the conclusion to T1&
(the type of x
), but that would be quite non-intuitive, as that would likely mean that we are assigning z
to y
in the execution semantics.
The assignment rule admits the following code to be well-typed:
class A { ... }; class B : A { int y; ... }; A a = B();If we replace this rule with a less general version like the following, then the code above does not remain well-typed.
|-x:T& |-y:T ------------------- [Assignment] |- x := y : TThis above rule is sound but not precise, in the sense it does not admit all the programs that we would like to admit. This is similar to how a previous imprecise rule did not admit
(1+2)+3
.
Notice that the conclusion assigns the more specific type T1
, which admits (A a = B()).y
as type correct.
|-e:TB |-x:T TB<:bool ---------------------------- [While] |- while (e) x : voidThis is a design decision. One could have said that the type of a while loop is the type of the last statement that executes. But then what if the statement didn't execute even once? Hence a type 'void' means that the statement has a type which is not usable.
|-s:T& ----------------------- [address-of] |- &s : T *Notice that the address-of operator is only well-typed when used on a reference type, e.g., it is illegal to say
&1
.
|- e1 : TB |- e2 : T2 |- e3 : T3 TB <: bool -------------------------------- [if-then-else] |- if e1 then e2 else e3 : voidNotice that it is by design that
if-then-else
always returns type, even when T2=T3
.
|- e1 : TB |- e2 : T2 |- e3 : T3 TB <: bool -------------------------------- [if-then-else] |- e1 ? e2 : e3 : val(lub(T2,T3))An upper bound
U(T2,T3)
of types T2
and T3
, is defined so that the following
hold: T2<:U
and T3<U
. A least upper bound
lub(T2,T3)
is defined so that lub
is an upper-bound of T2
and T3
,
and the further, for any other upper bound U(T2,T3)
, lub <: U
holds.
The lub
need not be unique for a partially ordered set (poset), even
in the presence of a void
type. However, for our language,
it turns out that lub
is unique. If lub
is not unique, this rule
should be modified appropriately, e.g., to resolve this ambiguity. One option is to have
a hypothesis that requires that there must exist a unique lub
of T2
and T3
for this expression to be well-typed.
The val
operator ensures that this cannot be now used in the LHS of an assignment.
But what is the type of a variable reference?
x is a variable ---------------- [Var] |- x:?The local, structural rule does not carry enough information to give
x
a type. We need more information in the rules.
A type environment gives types for free variables
x
is free in expression x
x
is free in expression x+y
x
is free in expression {int y = ...; x+y;}
y
is a bound variable in expression {int y = ...; x+y;}
The sentence O |- e:T
is read
e
has the type T
i is an integer literal (constant) ---------------------------------- [Int] O |- i:int
O |- e1:int O |- e2:int ---------------------------------- [Add] O |- e1+e2:intNotice that the same assumptions are required in both the hypotheses and the conclusion.
And we can write our first rule that uses type environment O
O(x) = T ------------- [Var] O |- x:T&Notice that we choose to keep the value type in
O
, and use it
to assign the reference type to x
--- this is a succinct way
of saying that all variables must be of a reference type.
We allow a declaration only at the beginning of a scope; moreover, there can be at most one declaration at the beginning of a scope, and multiple declarations require multiple nested scopes.
O[T0/x] |- e1:T1 T0 != void ---------------------------- [Let-No-Init] O |- { T0 x; e1 } : T1O[T/x] represents one function, where
O[T/x](x) = T O[T/x](y) = O(y)Notice that we use
T0/X
instead of T0&/X
--- that's because the Var
rule only expects the value type of the variable in the type environment, and uses it to assign a
reference type to the variable.
When we enter the scope, we add a new assumption in our type environment. When we leave the scope, we remove the assumption.
This terminology reminds us of the symbol table. So the type-environment is stored in the symbol table.
O[T0/x] |- e1:T1 O |- e2:T0 T0 != void ---------------------------- [Let-Init] O |- { T0 x = e2; e1 } : T1Notice that this rule says that the initializer
e2
should have the same type T0
as x
. This is imprecise, we should ideally allow such assignment as long as e2
is of a subtype of T0
.
A better version of [Let-Init] using subtyping:
O[T/x] |- e1:T1 O |- e2:T0 T0 <: T T != void ---------------------------- [Let-Init] O |- { T x = e2; e1 } : T1Notice that
T0
only needs to be a subtype of T
and not a subtype of T&
. This allows int x = 3
.
Notice that we disallow the construction of a term of type void&
. Consequently, we also disallow the construction
of a term of type void*
.
O |- e1:T1 ... O |- en:Tn ------------------------------ [Dispatch] O |- f(e1,e2,...,en):?Also maintain a mapping for method signatures in O
O(f) = (T1,...,Tn,T(n+1))means that there is a method f such that
f(x1:T1,x2:T2,x3:T3,...,xn:Tn):T(n+1)Hence, our dispatch rule becomes:
O |- e1:T1 ... O |- en:Tn O |- f:(T1',..,Tn',T(n+1)') T1<=T1',...Tn<=Tn' ---------------------------------------- [Dispatch] O |- f(e1,e2,...,en):T(n+1)Notice that this tackles both call-by-value and call-by reference. For example,
T1'
could be a reference type. Further, a reference-type can be used as an argument where a value-type was expected; but not vice-versa.
Object-oriented languages also implement encapsulation (e.g., some functions are only visible in certain classes, etc.). To handle them, one can
extend O to be a function from a tuple of class-name and the identifier-name, e.g., C.f
instead of f
.
Example: The following shows an example syntax of a method declaration and dispatch.
class C { .... ...foo(T1 a1, ..., Tn an) .... //method declaration. }; C c; c.foo(....);
The following shows an example syntax of a method declaration and dispatch with inheritance (subtyping) where D<:C
.
class C { .... ...foo(T1 a1, ..., Tn an) .... //method declaration. }; D c; c.foo(x1,...,xn);In this case, we will use scoping rules to resolve "which
foo
?". For example, in this case, the foo
is C
's foo. Notice that the scoping rule will need to know the type of c
and potentially
the types of x1..xn
to be able to identify "which foo
?",
and so scoping and typing work hand-in-hand. Irrespective, once the question has been answered, we annotate each such
dispatch with the class name corresponding to the method being dispatched, e.g., the dispatch in the example above becomes:
D c; c@C.foo(....);This disambiguation is required to determine the type of the return value, e.g.,
C.foo
may have a different
return type than B.foo
.
To express inheritance, we will use the subtype-relation for the
this
object in the method dispatch.
O |- e0:T0& O |- e1:T1 ... O |- en:Tn O |- T.f : (T1',..,Tn',T(n+1)) T0 <: T T1<:T1', ..., Tn<:Tn' ---------------------------------------- [Static-Dispatch] O |- e0@T.f(e1,e2,...,en):T(n+1)Notice that we require
e0
to be a reference type. In other words,
e0
should represent of a region of memory, also called an object.
int foo(int)
long foo(long)
string foo(string)
string foo(int)
cannot co-exist with the first foo
.
Type resolution in this case is performed by looking at the types of the arguments, and those definitions are enabled where the types of actual arguments are respective subtypes of the formal arguments. If no definitions are enabled, this is a type error. If more than one definitions are enabled, that is usually a type-ambiguity error. (Notice that using resolution rules here like, most closely matching type, etc., are unlikely to be intuitive for the programmer when there are multiple arguments and one argument matches more closely for one candidate while another argument matches more closely for another candidate).
For object-oriented languages, to resolve the this
keyword, we also need to know the "current class" C in which the expression is sitting.
To support this, we change the form of a typing rule (also called a sentence in logic) to:
O,C |- e:TWe use C=null for a context which is free of any "current class".
For example,
O,C |- e1:int O,C |- e2:int ---------------------------------- [Add] O,C |- e1+e2:intThe rule for the
this
keyword would be the following:
C != null --------------- [This] O,C : this: C*
O,C |- e1:int O,C |- e2:int ---------------------------------- [Add] O,C |- e1+e2:intThis says that to type-check
e1+e2
, then we have to type-check
e1
and e2
separately in the same O,C environment.
TypeCheck(Environment, e1+e2) = { T1 = TypeCheck(Environment, e1); Check T1 == int; T2 = TypeCheck(Environment, e2); Check T2 == int; return int; }Let's consider a more complex example:
O |- e2:T0 O[T&/x] |- e1:T1 T0 <: T T != void ---------------------------- [Let-Init] O |- { T x = e2; e1 } : T1Let's look at the corresponding type-checking implementation:
TypeCheck(Environment, { T x = e0; e1 }) = { T0 = TypeCheck(Environment, e0); T1 = TypeCheck(Environment.add(x:T&), e1); Check subtype(T0,T1); Check T != void; return T1; }
Recall that we claimed that type inference and type checking are similar, in that they use the same typing rules.
The typing rules can also be used to implement type inference. We can modify the TypeCheck
function,
so that along with the Environment
, it also passes down the acceptable types for a sub-expression, or
more generally type constraints. At the end, these constraints can be solved to identify the types of
the leaf nodes of the AST to implement type inference. If the constraints have no solution, it is a type error.
If the constraints have more than one solution, either there is a type ambiguity error, or there is a rule
to resolve the ambiguity. For example, the most general type may be used when
all the candidate types are ordered by the subtyping relation, i.e., the type that is the supertype
of all other candidates, e.g., \alpha-list is more general than int-list.
C c
" (object) or "C()
" (value) expression that created it.
If a typesystem is sound, we do not need to implement dynamic checks for the properties
that have been checked statically. For example, we do not need to dynamically check during
execution of 1+2
that 1
and 2
are both of type int
.
Consider C++ program:
class A { ... } class B : public A { ... } void foo() { A* foo = new A; B* bar = new B; .... foo = bar; }Static type of foo is "A*", but the dynamic type of foo is "A*" and "B*" at different program points.
Formalizing the relationship, for a sound typesystem: for all expressions E, dynamic_type(E) <: static_type(E)
In all executions, E evaluates to values of the type inferred by the compiler. You have to actually run the program to get the dynamic_type. In the early days of programming languages, equality (instead of subtype relation) was proved for their type systems.
Notice that rejecting all programs is sound. That's why we also care about precision, i.e., how many programs we can willingly accept.
C c; D d; //where D is a subtype of C .... D d2 = cond ? c : d;Let's say that at runtime
cond
can never be false, and so this program is dynamically type-correct.
However, statically figuring out that cond
is never false is undecidable.
Show a Venn diagram: statically-type-correct < dynamically-type-correct < parseable.
For soundness, statically-type-correct < dynamically-type-correct
should hold.
For completeness, dynamically-type-correct < statically-type-correct
should hold.
All operations that can be used on an object of type C can also be used on an object of type C' <= C.
The dynamic type is obtained through the execution semantics of the program, e.g., if we have an expression x++
, and the current state of the program determines x
to be of type int&
, there is a dynamic typing rule that says that the type of the new expression is also int&
. Consider the example { T x = e0; x; }
where e0:T0 <: T
; here the dynamic type of the expression is T0
(not T
!). Similarly, the dynamic
type of if x then y:T1 else Y:T2
will be either T1
or T2
(not lub(T1,T2)
or void
).
In other words, soundness states that the static type system will not accept any incorrect program (that will not pass the dynamic type check of equal power). A dynamic type check for array-bounds is not of equal power (it has more checks enabled); a sound static type check for array-bounds would reject all incorrect programs (but it will also reject a few more that will actually pass the dynamic type check).
Soundness of static type system: all dynamically-type-incorrect programs will be rejected. Ensured by ensuring that static_type is a supertype of dynamic_type. A trivial static-type system that rejects all programs is sound (but not useful).
Completeness: all dynamically-type-correct programs will be accepted. Not possible to ensure both soundness and completeness in general.
class Count { int i = 0; //default value = 0 Count inc() { i := i + 1; return *this; } };Now consider a subclass
Stock
of Count
:
class Stock : public Count { string name; //name of the item };And the following use of
Stock
:
Stock a; Stock b = a.inc(); ... b.name ...This code does not type-check because the return value of
inc
is Count
which is not a subtype of Stock
.
This seems like a major limitation as now the derived classes (subtypes) will be unable to use the inc
method.
a.inc()
has dynamic type Stock
.
So dynamically speaking, it is legitimate to write the following:
Stock b := a.inc();
But this is not well-typed as
a.inc()
has static type Count
.
Common workarounds to the incompleteness of a static type system:
dynamic_cast
: returns nullptr at runtime if not-successful; else returns a pointer to the object of the new type. Allows bypassing the static type system. May entail runtime cost.
template<typename T> class Count<T> { int i = 0; T inc() { i <-- i + 1; return *static_cast<T *>(this); } //static_cast gets checked at compile-time! } class Stock : public Count<Stock> { ... }The use of
static_cast
is only a convenience, as the compiler will check if it is guaranteed that
this
can be type-casted to T*
at compile-time (potentially using an analysis that is
external to the type-rules that we have discussed so far). One could replace static_cast
with
dynamic_cast
with the same effect (except that the compiler will no longer check the cast at compile time).
No_type
for use with ill-typed expressions
No_type <: C
for all types C
. Avoids cascading type errors due to one type-error.
No_type
No_type
result
No_type
at the bottom. A DAG may also occur due to multiple inheritance for example.