Our aim is to enable a machine to observe and interpret the behaviour of others. Mathematical models are employed to describe certain biological motions. The main challenge is to design models that are both tractable and meaningful. In the first part we will describe how computer vision techniques, in particular visual tracking, can be applied to recognize a small vocabulary of human actions in a constrained scenario. Mainly the problems of viewpoint and scale invariance need to be overcome to formalize a general framework. Hence the second part of the article is devoted to the question whether a particular human action should be captured in a single complex model or whether it is more promising to make extensive use of semantic knowledge and a collection of low–level models that encode certain motion primitives. Scene context plays a crucial role if we intend to give a higher–level interpretation rather than a low–level physical description of the observed motion. A semantic knowledge base is used to establish the scene context. This approach consists of three main components: visual analysis, the mapping from vision to language and the search of the semantic database. A small number of robust visual detectors is used to generate a higher–level description of the scene. The approach together with a number of results is presented in the third part of this article.