Company Info
Technology
Products
Services

 

Spacer gifSpacer gif
Spacer gif
Scianta Intelligence Turning Knowledge into Intelligence

Building Intelligent Models

A Look at Some Fundamental Principles

©2001 Earl Cox

The great tragedy of science –
the slaying of a beautiful hypothesis by an ugly fact. --

T. H. Huxley (1825-95)
Collected Essays (1893-4) “Biogenesis and Abiogenesis

Who dares wins.
Motto of the British Special Air Service regiment, from 1942.
See J. L. Collins Elite Forces: the SAS (1986), Introduction.

A few years ago the hot topics in the Information Technology world were Data Warehouses and Data Marts. Both Chief Information Officers and their customers wanted a centralized facility that stored, purified, correlated, and served up information on-demand. While the idea of a data warehouse remains the core ideal of most corporate IT shops, the concepts surrounding the organization and architecture and, especially, the delivery mechanisms, have changed remarkably. In today’s rapid changing and highly competitive marketplace, the idea of physical centralization has given way to a virtual data warehouse tied together with message oriented middleware and distributed through application serves, web servers, and intelligent database systems.

The over-riding influence in the corporate response to its information assets has been, of course, the dramatic rise of the internet as a knowledge bearing framework. From the global reach of the Internet, corporations have carved out their own pieces of this universe. – intranets to bind together the information needs of the enterprise, xtranets to solidify and control supply chains, and B2B and B2C service nets to give even the smallest corporation an equal footing with corporate giants as well as an essentially low cost worldwide on-line presence. The Internet has given corporate decision makers and knowledge workers a vast (and sometimes seemingly infinite) access to raw data – in fact, to “raw” knowledge. The easy access to data and the corresponding easy access to powerful analytical tools – Microsoft Excel, and WizSoft’s WizWhy, as an example, often leads both end users and corporate model builders into a “build right now, worry about validity much later” approach to model building. The idea of prototyping (or protocycling) has given way to hammering together models without sufficient thought to the mechanics underlying the model building process. Using this approach in building intelligent or knowledge-based models is especially risky. This article takes up a few of the important issues associated with designing and executing a knowledge-based model. We examine the nature of data variables, the relationship between data spaces and fuzzy spaces, the meaning of experimental controls, coping with noise, ambiguity, and missing data, the isolation of dependent and independent variables, and the use of statistical and regression analyses. We note that data mining alone is insufficient to build most real-world business process models. Thus, as an integral part of the modeling methodology, we consider the part of subject matter experts and the design and development of conventional rule-based expert systems.

Types of Models

Systems and knowledge engineers use the word “model” in a variety of contexts, but nearly all of them refer to some digital implementation of a well defined process. But the world of models and model building encompasses a wide variety of representations. Although we are primarily concerned with system models, the evolution of such models often pass through or encompass other modeling organizations. Figure 1 shows the basic model types and some of their possible interconnections.

Types of Models

Figure 1 Types of Models

Naturally, neither the model taxonomies nor the model boundaries are absolutes. As the dashed lines in the previous figure illustrates one model may be the prelude to another (we often develop a narrative model before expanding our ideas into a mathematical or system heuristic model). Further the classification of a model into one class or the other is not always possible – the boundaries are very permeable. Table 1 summarizes how these model differ and how they are used.

Nature of Model - Organization and Use

Narrative

These models employ textual descriptions of some formal existing or hypothetical system. Although we seldom think of them as such, descriptions of a business process – procedure manuals, as an example – as well as government policy regulations are narrative models. Scientific theories are conventionally narrative models. Examples include the definition of physical systems such as the laws of gravitation (Newtonian and Einsteinian), quantum mechanics (Quantum Electrodynamics and Quantum Chromodynamics), and the theory of evolution. The latter tends to be a prime example of a primarily narrative scientific theory.

Physical

These models are constructed to test or evaluate some essentially physical system. They are, what the German quantum theorists called, anschaulich, or more literally, “look at-able”. Examples of physical models include precision aircraft models for use in wind tunnels, architectural renderings of building and bridges, and the molecular construction kits used in organic chemistry and genetics to represent such things a benzene rings and the helical DNA molecule.

Analog
(Simile)

These models combine the properties of one system to describe the behavior of another system using different physical forms. Thus they are similes – one systems function like another systems. Analog models work because there is a similarity or parallelism between the underlying forces that drive both models. Until the recent accessibility of the personal computer, electrical and mechanical analog systems were routinely used to model complex networks such as process plants and highway traffic flows.

Mathematical

Mathematical models came into their own right with the common availability of digital computers (through timesharing in the early days and now routinely on desk top and lap top personal computers). A mathematical model consists of equations, often with interdependencies. The ubiquitous spread sheet is a prime example of a mathematical model. Control engineering systems and embedded controllers are also examples of mathematical models. Many of the models used in knowledge discovery and data mining are also mathematical models. These include neural networks, and other forms of classification schemes (such as the Decision Trees (ID3 and C4.5), Classification and Regression Trees (CART) and Chi-Squared Automatic Induction (CHAID) algorithms.

Heuristic

The recent rise in machine intelligence and expert system technologies has introduced another type of model into the mix – Heuristic Systems Model. These are often called if-then-else production systems and form the core knowledge repository of today’s decision support and expert systems. These models depend solely on the high speed, high computational capabilities of the computer. Heuristic models embody “rules of thumb” and other business processes. We often refer to them as policy-based models and heuristic representations comprise the vast majority of modern business process models (BPM’s).

Table 1 Model Categories (Taxonomies)

The last two taxonomic classes – mathematical and heuristic – form the focus of nearly all business process models. These are roughly classed as symbolic models and generally incorporate both algebraic as well as intellectual relationships. In the modern sense of the word model, we generally combine the two taxonomies into a single or hybrid class: the knowledge-base model.

Model and Event State Categorization

There is, however, a more fundamental classification of models based the way model states are generated and how model variables are handled. Table 2 shows the partitioning – into discrete or continuous and into deterministic or stochastic. Event State

Model Classification by Internal Structure

Table 2 Model Classification by Internal Structure

Technically, the classification of models as discrete or continuous refer to the model’s composite variable organization, but in actual fact and practice the term is used to describe the model’s treatment of time. Many models have clear and precise demarcations of time which break up the system into regular intervals (such as many queuing applications, production or project schedules, traffic analysis models, portfolio safety and suitability models, etc). Other models have an horizon that varies smoothly across time. These continuous models are rare in the business world, although they do occur in nonlinear random walk models that attempt to follow the chaotic trends of the stock market. In any event, of course, few continuous models are actually implemented continuously but use a form of periodic or random data sampling.

The model state categorization – and a fundamentally important perspective on how the model is constructed – reflects the ways in which the underlying model relationships are or can be described. Outcomes in a deterministic model can be predicted completely if the independent variables (input values) and the initial state of the model are known. This means that a given input always produces a given output. The outcome for a stochastic system is not similarly defined. Stochastic models have an intrinsic degree of randomness coupling the variables with the ongoing state of the model.

Model Type and Outcome Categorization

From a knowledge and systems engineer’s perspective, the most critical model classification isolates the kind of outcome. There are two broad types: predictive and classification. Like taxonomic partitioning, these types are hardly “pure” and models often involve both. Table 3 shows model families sorted into clusters according to the underlying model taxonomy and the outcome type,

Model Classification By Outcome

Table 3 Model Classification By Outcome

A predictive model generates a new outcome (the value of the dependent variable) based on the previous state of independent variables. Regression analysis, as an example, is a predictive time-series model. Given a least squares linear interpretation of a data points x1, x2, x3, …, xn, the regression model predicts the value of point xn+1 (and a vector of subsequent points with varying degrees of accuracy). On the other hand, classification models analyze the properties of a data point and assign it to a class or category. Cluster analysis is a classification model. Neural networks are also predominantly classifiers – the activate an outcome neuron based on the activation of the input neurons. Such taxonomies, however, are hardly rigid. A rule-based predictive model can, under many circumstances, be viewed as a classifier. And many classification models, especially those based on decision tree algorithms, can generate rules that produce a model prediction.

Knowledge-Based Models

Aside from purely statistical models used for simple data segmentation, hypothesis testing, and regression analysis, most advanced modeling approaches incorporate computational intelligence components such as neural networks, fuzzy systems, and expert systems technologies. These fall under the general rubric of knowledge-based models. The knowledge-base model fuses both machine intelligence with purely algebraic formulations. Figure 2 shows the high level schematic of a typical knowledge-based modeling environment.

A Knowledge-Based Modeling Environment

Figure 2 A Knowledge-Based Modeling Environment

The hybrid model exists inside the knowledge base, expressed as a collection of non-procedural rules. The rules are non-procedural in the sense that their order inside the knowledge base is unimportant. It is the responsibility of the inference engine (see discussion below) to find the rules representing the current model state and place them in the proper execution order. Table 4 describes the principal components of the knowledge-based modeling environment.

Component - How It’s Used

Knowledge Base

A centralized (or more often in today’s Web-centric environments, distributed) repository of business or technical intelligence. The knowledge base contains a wide variety of components: variables, fuzzy sets, procedures, and if-then-else rules. Often a knowledge base is decomposed into smaller units, called policies, each of which acts as a stand-alone model under the control of a high level organizing mechanism.

Inference Engine

The core reasoning mechanism in a knowledge-based model. This is the seat of the model’s machine intelligence. An inference engine actually performs the model by recognizing or finding a goal state, collecting rules from the knowledge base, ordering them by some prioritization scheme, and then executing each in order. Inference engines implement the high level reasoning protocol such a backward or forward chaining and fuzzy or approximate reasoning.

Agenda Manager

A mechanism for selecting and ordering rules. An agenda manager dynamically creates a list of rules that will be executed in a certain sequence. Often called the Conflict Resolution Agenda, the manager resolves conflicts between which rules should be active during different states of the model.

Rule Induction

A method of discovering relationships in large databases and deriving the if-then rules which describe the behavior of these patterns. Rule induction is used to extract rules form the data and populate an incipient model. The process of inducing rules, also called knowledge discovery, draws on such wide ranging technologies as decision trees, neural networks, genetic algorithms, and evolutionary programming.

Table 4 Knowledge-Based Model Components

Because knowledge base systems are constructed from non-procedural rules they form the core technology in most predictive data mining and business process modeling projects. The goal of knowledge discovery is not simply the unveiling of patterns deep in corporate databases (and spreadsheets), but the validations and consolidation of the knowledge into business process models. These models form a powerful battery of predictive and classification tools. They acquire and formalize corporate intelligence in a way that can be brought to bear on difficult problems connected with the long term survival of the organization. This same consideration of an intelligence utility function is just as important for government policy makers as well as military strategy planners.

Knowledge, Intelligence, and Models

We have completed a brief tour through the taxonomy of models. But this classification scheme does not address a fundamental issue – what are the foundations of corporate or government agency models? This is an important question because it leads to the fundamental decisions and actions necessary to create, validate, and deploy realistic and robust models. Any methodological process that addresses this question must also address the underlying relationships between data, information, and knowledge. Figure 3 illustrates the organization of this relationship and shows some of the concerns at each level in the hierarchy.

The Foundations of Knowledge

Figure 3 The Foundations of Knowledge

Raw data lies at the base of the knowledge pyramid. Estimates put data acquisition, profiling, cleaning, and organization at somewhere between fifty and sixty percent of the total time in a data mining project. Understanding the nature of a model’s data substrata is crucial to interpreting and validating the model’s performance. For, unless we have confidence in our understanding of what the data is telling us, we cannot reliably measure the degree to which unusual patterns are artifacts of the data, errors in our analysis, or errors in the underlying model.

Clean and trustworthy data is the basis for information. It is during this stage of the modeling process that most of the embedded data relationships are revealed. Transforming data into information essentially involves computing and organizing aggregated data. Such collections often involve counts, classifications, ratios, and totals. A few examples include,

Percent sales of product X by region by quarter
Total backorders by product line
Investment strategies ranked by income class

Information provides the analyst with a deep understanding of the way data is related and possibly organized within a database. In many data mining and model development projects, the “raw” data resides at the information level. As an example, a customer service model aggregates data into ratios and totals reflecting the activity between a customer representative and the customer calls and combines this with information about the experience of the representative. The data mining project evolved a model that predicted lost customers based on the type of exchange and the profile of the representative.

Knowledge is the result of using information to gain a clear and deep perspective on the nature of a business process. The act of converting information into knowledge involves a consolidation of information into a form that reveals basic properties of the process rather than the data. We often do this through computer modeling. Knowledge-based models take knowledge and use it to generate intelligence – in this sense, the application of knowledge to simulate the behavior of a system involving people and processes. Figure 4 illustrates how collections of knowledge based models – working together - across a variety of business processes contribute to the intellectual assets of an organization.

The Information Platform

Figure 4 The Information Platform

This article has examined some of the fundamental issues in building knowledge-based model from a perspective linked tightly to the concepts of rule induction and analytical data mining. The methodology structure is intended as a broad schematic. It is not dogma. Each organization must approach the knowledge discovery process from its own concern for technology, tools, resources, goals, capitalization, and acceptance within the corporate culture.

Model Development and Protocycling

Data mining and the allied disciplines of knowledge discovery are ways of finding patterns and discovering rules describing these patterns. They are not, however, a self-sufficient technique for building models. Models require a synergy (cooperation, collaboration, and symbiosis) between human experts and other forms of knowledge. This means that the model evolution process involves a model construction and validation phase (the subject matter expert (or SME) component). and often a discovery phase (the data mining component.) Often it is prudent to do a little data mining before working with the subject matter expert. Sometimes knowledge discovery can also aid the subject matter expert in more fully exploring their own expertise and their own views of the underlying problem states. These phases are usually iterative. They are repeated until the model outcome consistently converges on a prediction or classification that falls within a small standard error. Figure 5 shows this fusion of knowledge discovery and subject matte expertise in the model development cycle.

The Protocycling Development Cycle

Figure 5 The Protocycling Development Cycle

Initial rule sets discovered by the data mining process are evaluated and, if needed, tuned by the subject matter expert. Rule discovery may involve many re-generation cycles as the parameters of the rule induction engine are tuned or modified to extract rules based on more focused knowledge (often these parameter changes involve adding or deleting variables or fine tuning the fuzzy sets associated with variables). The induced rules become fused with those elicited from the subject matter expert to form the model’s working knowledge base. At this point, the model is executed against validation data. If the error is within acceptable tolerances, it is deployed, otherwise we start another cycle of refinements.

Subject Matter Experts and Acquired Knowledge

Subject Matter Experts (SME’s) provide the basic source of intelligence in an expert system. Other sources include articles, procedure manuals, reference documents, repair manuals, and so forth – however, knowledge from articles and other documents should be used carefully since they are subject to both practice and procedural errors (manuals, sometimes written before a process is actually implemented, often describe how something should be done, not how it is actually done.) The basic foundation of knowledge engineering involves the extraction of the actual decision making actions underlying a process. Processes identification – BPM, or Business Process Modeling – involves the functional decomposition of an expert’s activities into a set of tasks (sometimes these tasks as called policies or contexts or functional units.)

Knowledge acquisition begins with a narrative. This narrative is extracted form the subject matter expert through gentle, low interference but consistent guidance from the knowledge engineer. It is important to capture the expert’s sense of how he or she perceives the task’s nature, its degree of difficulty and complexity, and how they go about making decisions. The preliminary interview show capture as much of the process as possible in the expert’s own voice. At a high level, the general narrative extraction technique follows these steps,

CREATE NARRATIVE:
Record and transcribe the preliminary narrative
find inconsistencies
identify ambiguous or vague terms
attempt to assign sequence to the narrative
Cycle:
Clarify problems with subject matter expert
Re-write narrative, develop list of unresolved questions
Wait several days
Re-interview subject matter expert, narrowing focus.
Repeat Cycle: until narrative is satisfactory

Usually no single narrative cycle is completely sufficient to create the rules for a real-world task. On the other hand, a knowledge engineer can not continue the process of narrative refinement without stopping at some point to begin knowledge extraction (if nothing else you will ultimately alienate the expert and generate friction with the expert’s manager.) Eventually, no matter at what point you freeze the expert’s story telling, you will need to refine the narrative and (consequently) refine all the derived knowledge. Creating a preliminary knowledge representation is called a prototype. A process of arriving at a working prototype is called protocycling.

The Methodology

Approaches to model development differ according to the needs of the corporation, the analytical culture of an enterprise, and the requirements for compatibility with standards (such as object-orientation.) We now look briefly at a high level methodology for model construction that is independent of the underlying language or architecture. Data Mining, or, more properly, knowledge discovery, is the process of uncovering behavior patterns buried deep in large quantities of raw data. This methodology follows a rule induction technique to actually build a working process model of these behaviors. The actual model development process consists of several steps, as illustrated in Figure 6, below.

The Knowledge Discovery Methodology

Figure 6. The Knowledge Discovery Methodology
(Problem Definition Phase Highlighted)

Problem Definition is a crucial first step in deciding what we are attempting model, its basic components, and the nature and meaning of the data elements. Two critical outcomes are generated in this phase: a statement of the project (model) Objective(s) and a complete Data Dictionary. Although this may seem elementary on the surface, the actual specification of the project objectives is absolutely crucial to the success of the project. The object statement defines what is expected from the model, how it will be judged and evaluated (when will we know, as a not so trivial example, when the project is complete?), and what decisions will be made based on the model output. The objective statement also indicates the kind of model we will build (optimization, forecasting, analysis, or comparison, as an example) and the kind of knowledge discovery technique required (supervised or unsupervised.)

The Data Dictionary defines all the terms, concepts, and data elements in the problem, establishes standard abbreviations, and gives each entry a precise and formal definition. As Figure 7 illustrates the dictionary is a repository of project standards, data sources, as well as sharable information about the project components and products.

The Basic Data Dictionary

Figure 7 The Basic Data Dictionary

The size, scope, completeness, and details of the data dictionary are project dependent. Small projects are less likely to create a full dictionary, while projects associated with regular mining of the operational data store in a Data Warehouse might construct a fully detailed dictionary. Even for a small project, however, the data dictionary is an essential project component.

Data dictionaries are often constructed in HTML (Hypertext Mark-Up Language) or, more recently, in XML (the Extensible Markup Language) and reside on the corporation’s intranet (or a secure web site on the internet). This approach permits an efficient cross-indexing (through hypertext links) of all the dictionary components. Web-resident data dictionaries also highly sharable and always up-to-date. Java-based projects can use the JavaDoc utility to generate HTML documentation for models, algorithms, and design documents.

This centralized data dictionary definition is used by all people and systems interacting with the data mining project. We note that there is nothing within the boundaries of the project that necessarily changes the formal definition of elements as used by the organization and ambiguities of meaning and terminology may persist in the company at large; however, within the confines of the project, the project manager insists that all users who deal with the project accept the definitions established by the Data Dictionary.

Data Cleaning and Analysis Phase

Figure 8. Data Cleaning and Analysis Phase

The next phase in the knowledge discovery process is data cleaning and analysis (see Figure 8). This is the most time consuming and, for many organizations, the most difficult, part of the process. Using the data dictionary, access to all the data sources must be secured. The critical elements for analysis are isolated and a process of scrubbing and purifying the data begins. Data cleaning starts with an assessment of the data properties and moves to making the complete data space as consistent and error-free as possible. Understanding the semantics and relationships between data elements in a model is two-thirds of the battle in building a model, yet, it is the also the weakest link in the methodology. We cannot rely on decision tree algorithms, regression analysis, or statistical correlation techniques to filter out what is important, what is immaterial, what is related to what. A model builder must have a deep understanding of the “semiotics” – the meaning – of each model element.

In Conclusion

Thus, in a larger and more global views of model development, the process of coupling human expertise, machine reasoning, and knowledge discovery forms the cornerstone of all successful model building processes. Too often we confuse the technology with the end product, leading us to develop models that produce results that are too brittle and difficult to objectively justify. In previous articles I have discussed the use of fuzzy models to add robustness and flexibility to eBussiness and eCommerce models and the use of fuzzy measurements to incorporated a nature way of dealing with uncertainty and ambiguity (a necessity in today’s CRM centric business environment). However, neither fuzzy logic, nor neural networks, nor spreadsheets, nor rule discovery systems can take the place of human insight and careful attention to the organization and functional processes in knowledge-based models. It has been my experience, in over twenty-seven years of model building, that the technology is rarely at fault in our models – model fail because we as designers and implementers did not pay attention to a common sense rule all of us learned when we became professionals – Garbage In, garbage Out (GIGO).


 

References

Berkan, R., Trubatch, S., Fuzzy Systems Design Principles: Building Fuzzy IF-THEN Rule Bases, New York, IEEE Press (1997)

Booch, G., Object-Oriented Analysis and Design With Applications, 2nd Ed., New York, Addison-Wesley Publishing (1994)

Booch, G., Object Solutions: Managing the Object-Oriented Project, New York, Addison-Wesley Publishing (1996)

Machlup, F., Mansfield, U. (eds) The Study of Information: Interdisciplinary Messages, New York, John Wiley (1983)

McPherson, G., Statistics in Scientific Investigation: Its Basis, Application, and Interpretation, New York, Springer-Verlag (1990)

Watkins, P.,Eliot, L. (eds) Expert Systems in Business and Finance – Issues and Applications, New York, John Wiley (1993)

Weinberg, G., An Introduction to General Systems Thinking, New York, John Wiley (1975)

TOP OF PAGE

Scianta SI
© Scianta Intelligence 2005 all rights reserved
For more information or to schedule a presentation call (919) 678-0477

Spacer gif Spacer gif nav_top nav_top nav_top nav_top