### See also Constud version 3

### Components of the System

- Software application Constud.
- Data base(s): knowledge base(s) of observations (feature vectors of so-called cases), parameters of machine learning,results of machine learning, and predicted values.
- Data layers of explanatory variables, pre-classifiers and interpolation polygons.

Data layers are needed only when spatial data are used. In the case of non-spatial data, the system consists of two components — knowledge base and the software.

### Constud can:

- calculate indices of spatial pattern into a database or binary raster file,
- search and weight features and exemplars needed for the most reliable similarity-based predictions,
- predict nominal and numerical variables,
- calculate similarity between an observation and exemplars of a given class,
- store the predictions in a database or as binary raster maps,
- accommodate new data during machine learning.

### Special features

- Computation of local statistics automatically from many raster files.
- Extensive possibilities in the choice of local kernel and sample parameters.
- Possibility to compute local spatial indices using a mask layer.
- Some indices of spatial pattern rarely found in other applications. E.g. gradient strength, stripeness, and mode weighted by inverse distance.
- Data layers, metadata, exemplars, parameters, results of machine learning and predictions are kept in a single database.
- Search for the best solution is a continuous iterative process. Experience obtained during the process is saved in the database as numerical values of actuality.
- Features and cases can be added to and excluded from the training sample without interrupting the learning process.
- Hitherto the best set of weights for features and exemplars can anytime be used for map generation.
- The option to learn feature and exemplar weights separately for every class of a multinomial feature.
- The system uses a leave-one-out cross validation during the learning process, and a separate validation sample drawn from the training data, to select the hitherto best set of features and exemplars.

### Changes in Constud system

#### Changes from February 15th 2012 (not included in the Constud Tutorial)

Two indices are added to the system:

1) index 17 as distance to a given category measured in the same distance units as are pixel sides in the table D_LAYERS;

2) index 18 as minimum distance to the class boundary measured in the same distance units as are pixel sides in the table D_LAYERS;

These indices are analogous to indices No 13 and 12 which measurement units are pixels.

Radii in the table EXPL_VAR must be in the same distance units as are the coordinates of locations and as are pixel sides in the table D_LAYERS
Distances to a given class are recorded in the same units as are the coordinates of locations.

#### Previous changes (accounted in the tutorial)

Short integers (2 bytes) are allowed for: the layer of preclassifying polygons, interpolation polygons, data layers treated as numerical predictors. Byte format is still obligatory for the preclassifier feature of the dependent variable, for data layers treated as nominal, values of predictors (incl. those calculated from integer format data layers). Dependent numerical variable is treated as 4 bytes real.

The fields [precl_k], [precl_layer_k], [radius_k] of the table [EXPL_VAR] are any more not used in the knowledge base. The corresponding parameters are read from the fields [precl], [precl_layer], [radius]. A single format field [EXPL_VAR].[divisor] is added for converting integer format source layer to byte format indices.

Only the compact format of log tables applied since 2008 is allowed. The field indicating substitute features in the table [EXPL_VAR] can only be [substitute] and not any more alternatively [replaced_by].

Prediction fit is calculated relative to total sample not relative to the calculated observations. The results are different if the prediction of the dependent variable of some observations is not possible due to e.g. absence of similar exemplars, a unique preclassifier value or applicable date interval restrictions.

The main dialog window of Constud enables to use a common prefix added to the beginning of all folder names in the table [D_LAYERS]. The data layers can be located in a FTP folder and used over internet.

From 1.09.2010 Constud supports MS SQLServer databases and MS Windows 64 bit operating systems.

### Requirements for Data Layers

In Constud, the main utility of the data layers is the use for extracting the local spatial indices from these. In addition, one of the data layers selected by user can be utilised as pre-classifier and in generation process of predictive maps the pre-calculated data layers of explanatory variables (substitute feature layers) and interpolation polygon layers can be used.

The data layers of spatial data, pre-classifiers and interpolation polygons used in system Constud must correspond to the requirements and must be organised as follows.

- All data layers must have unpacked binary raster format saved by rows and without header (Idrisi32 rst-format).
- Data type of the raster layers should be byte (integer values 0…255) or signed 16-bit (two bytes integers ranging in value from -32768 through 32767).
- All raster files used must be in the same projection and in the same coordinate system.
- The name of each raster file must contain only digits and the extension .rst in its name. The number expressed by digits must not exceed integer format (2147483647).
- The number of rows and columns in all raster must either be equal or both minimum and maximum coordinates of every map sheet must be registered in the table map_sheets.
- The cell size in raster files can be fixed by user, but within the same data layer it has to remain constant.
- Every data layer must be located in a different subdirectory.
- The data layers must be registered and their metadata must be fixed in the table [D_LAYERS].

### Spatial Indices in Constud

**Indices calculated from nominal data layers**

**Share of a given class** - the share of given class within the kernel.

**Mode** - the code of category having the largest share in kernel.

**Shannon’s index of diversity** - the diversity of pixel classes according to the formula H = –10 ∙∑ pi∙ log2pi, where pi ― share of the class i in kernel.

**Lloyd’s index of equitability** - calculated using the formula 10 ∙ H / lg(s), where H ― Shannon’s diversity, s ― number of classes within kernel.

**Dominance** - calculated using the formula 100 ∙ ∑ pi2 i, where pi ― share of the class i in kernel.

**Number of classes** - the number of different classes within the kernel.

**Class adjacency** - the ratio of edges between the pixels of the same class to the number of all edges. Equals 100 when all pixels within the kernel belong to the same category.

**Direction of patches** - 0 means one pixel wide vertical stripes, 90 means horizontal one pixel wide stripes. Value 255 marks the absence of patch borders.

**Class proximity** - the ratio of the sum of inverse distances between pixel centres of the same class to the sum of inverse distances of all pixel pairs.

**Share of different class pairs** - the ratio of pixel pairs with pixels belonging to different class to the number of all pixel pairs within kernel.

**Distance weighted mode** - the mode weighted by inverse distance. The category for which the sum of inverse distances of pixels of the same class from kernel centre is the greatest. The focal pixel of the kernel is not included.

**Distance to class boundary** - the distance in pixels to the closest class edge from the focal pixel.

**Distance from a given class** - the distance from focal pixel to the closest pixel of a given category in pixels.

### Indices calculated from numerical data layers

**Share of pixel values above the mean** - the share of pixel values [%] that exceed the mean value within the kernel expressing the asymmetry of the distribution of values.

**Mean** - the arithmetic mean of pixel values within kernel.

**Standard deviation** - the square root of the sum of squared deviations of pixel values from the local mean.

**Median** - pixel value for which the number of smaller pixel values is equal to the number of pixels having higher values.

**Moran’s I of 8 neighbouring pixels** - spatial autocorrelation according to Moran’s I of 8 neighbouring pixels. 0 means maximum possible negative spatial autocorrelation, 100 means absence of spatial autocorrelation and 200 ― maximum positive spatial autocorrelation (similar values adjoin).

**Distance weighted Moran’s I** - the inverse distance weighted Moran’s I having the same scale as the previous index.

**Difference of neighbouring pixels** - the mean difference between adjacent pixels.

**Coefficient of variation** - the ratio of the standard deviation to the mean × 100.

**Gradient direction** - ― direction of the inclination of the linear trend surface within kernel. 255 means no gradient, 50 ― increasing values in direction of y-axis, 150 ― decreasing values in direction of y-axis, 100 ― increasing values in direction of x-axis, 0 and 200 ― decreasing values in direction of x-axis.

**Difference of border to centre** - difference between the mean value of pixels within half radius and the mean value of pixels located between half and full radius of the kernel.

**Minimum** - the minimum value within kernel.

**Maximum** - the maximum value within kernel.

**Factor of kurtosis** - calculated using the formula:,
where xi ― pixel value,
― the mean of pixel values, n – the number of pixels within kernel and σ ― the standard deviation of pixel values. Greater values appear where areas of different brightness are close together.

**Gradient smoothness** - homogeneity of neighbouring pixels. Calculated using the formula:
where xi and xj ― pixel values of neighbouring pixels in horizontal, vertical and diagonal directions and n ― the number of pixels within kernel.

**Difference from the mean** - difference between the focal pixel value and the arithmetic mean of all pixels within kernel.

**Gradient intensity** - the angle of inclination of the linear trend surface of pixel values within kernel. Difference of pixels within kernel to mean pixel value is multiplied by pixel distance in gradient direction. Values range from 0 to 200, 0 means zero gradient (sum of weighted pixel distances on gradient equals the sum of un-weighted pixel values on gradient). Value 10 means the pixel values on a gradient change by one when distance is increased by one pixel.

**Distance weighted mean** - the mean of pixel values weighted by inverse distance from the focal pixel.

**Stripeness** - calculated first as the largest difference between the average of 9 pixels and 3 pixels in a line in four directions (north to south, east to west, north-west to south-east, and north-east to south-west) around every pixel and thereafter as the mean of all partial differences within the sample in direction of the maximum difference. The stripeness = 0 if no directional structures are present.