Next: Functions and Variables for descriptive statistics, Previous: Introduction to descriptive, Up: descriptive [Contents][Index]
Builds a sample from a table of absolute frequencies. The input table can be a matrix or a list of lists, all of them of equal size. The number of columns or the length of the lists must be greater than 1. The last element of each row or list is interpreted as the absolute frequency. The output is always a sample in matrix form.
Examples:
Univariate frequency table.
(%i1) load ("descriptive")$ (%i2) sam1: build_sample([[6,1], [j,2], [2,1]]); [ 6 ] [ ] [ j ] (%o2) [ ] [ j ] [ ] [ 2 ] (%i3) mean(sam1); 2 j + 8 (%o3) [-------] 4 (%i4) barsplot(sam1) $
Multivariate frequency table.
(%i1) load ("descriptive")$ (%i2) sam2: build_sample([[6,3,1], [5,6,2], [u,2,1],[6,8,2]]) ; [ 6 3 ] [ ] [ 5 6 ] [ ] [ 5 6 ] (%o2) [ ] [ u 2 ] [ ] [ 6 8 ] [ ] [ 6 8 ] (%i3) cov(sam2); [ 2 2 ] [ u + 158 (u + 28) 2 u + 174 11 (u + 28) ] [ -------- - --------- --------- - ----------- ] (%o3) [ 6 36 6 12 ] [ ] [ 2 u + 174 11 (u + 28) 21 ] [ --------- - ----------- -- ] [ 6 12 4 ] (%i4) barsplot(sam2, grouping=stacked) $
The first argument of continuous_freq
must be a list
or 1-dimensional array (as created by make_array
) of numbers.
Divides the range in intervals and counts how many values
are inside them. The second argument is optional and either
equals the number of classes we want, 10 by default, or
equals a list containing the class limits and the number of
classes we want, or a list containing only the limits.
If sample values are all equal, this function returns only one class of amplitude 2.
Examples:
Optional argument indicates the number of classes we want.
The first list in the output contains the interval limits, and
the second the corresponding counts: there are 16 digits inside
the interval [0, 1.8]
, 24 digits in (1.8, 3.6]
, and so on.
(%i1) load ("descriptive")$ (%i2) s1 : read_list (file_search ("pidigits.data"))$ (%i3) continuous_freq (s1, 5); (%o3) [[0, 1.8, 3.6, 5.4, 7.2, 9.0], [16, 24, 18, 17, 25]]
Optional argument indicates we want 7 classes with limits -2 and 12:
(%i1) load ("descriptive")$ (%i2) s1 : read_list (file_search ("pidigits.data"))$ (%i3) continuous_freq (s1, [-2,12,7]); (%o3) [[- 2, 0, 2, 4, 6, 8, 10, 12], [8, 20, 22, 17, 20, 13, 0]]
Optional argument indicates we want the default number of classes with limits -2 and 12:
(%i1) load ("descriptive")$ (%i2) s1 : read_list (file_search ("pidigits.data"))$ (%i3) continuous_freq (s1, [-2,12]); 3 4 11 18 32 39 46 53 (%o3) [[- 2, - -, -, --, --, 5, --, --, --, --, 12], 5 5 5 5 5 5 5 5 [0, 8, 20, 12, 18, 9, 8, 25, 0, 0]]
The first argument may be an array.
(%i1) load ("descriptive")$ (%i2) s1 : read_list (file_search ("pidigits.data"))$ (%i3) a1 : make_array (fixnum, length (s1)) $ (%i4) fillarray (a1, s1); (%o4) {Lisp Array: #(3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 \ 5 0 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 0 5 8 2 0 9 7 4 9 4 4 5 9 2 3 0 7 8 1 6 4 0 6 2 8 6 2 0 8 9 9 8 6 2 8 0 3 4 8 2 5 3 4 2 \ 1 1 7 0 6 7)} (%i5) continuous_freq (a1); 9 9 27 18 9 27 63 36 81 (%o5) [[0, --, -, --, --, -, --, --, --, --, 9], 10 5 10 5 2 5 10 5 10 [8, 8, 12, 12, 10, 8, 9, 8, 12, 13]]
Counts absolute frequencies in discrete samples, both numeric and categorical. Its unique argument is a list,
or 1-dimensional array (as created by make_array
).
(%i1) load ("descriptive")$ (%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) discrete_freq (s1); (%o3) [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [8, 8, 12, 12, 10, 8, 9, 8, 12, 13]]
The first list gives the sample values and the second their absolute frequencies. Commands ? col
and ? transpose
should help you to understand the last input.
The argument may be an array.
(%i1) load ("descriptive")$ (%i2) s1 : read_list (file_search ("pidigits.data"))$ (%i3) a1 : make_array (fixnum, length (s1)) $ (%i4) fillarray (a1, s1); (%o4) {Lisp Array: #(3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 \ 5 0 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 0 5 8 2 0 9 7 4 9 4 4 5 9 2 3 0 7 8 1 6 4 0 6 2 8 6 2 0 8 9 9 8 6 2 8 0 3 4 8 2 5 3 4 2 \ 1 1 7 0 6 7)} (%i5) discrete_freq (a1); (%o5) [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [8, 8, 12, 12, 10, 8, 9, 8, 12, 13]]
Subtracts to each element of the list the sample mean and divides
the result by the standard deviation. When the input is a matrix,
standardize
subtracts to each row the multivariate mean, and then
divides each component by the corresponding standard deviation.
This is a sort of variant of the Maxima submatrix
function.
The first argument is the data matrix, the second is a predicate function
and optional additional arguments are the numbers of the columns to be taken.
Its behaviour is better understood with examples.
These are multivariate records in which the wind speed
in the first meteorological station were greater than 18.
See that in the lambda expression the i-th component is
referred to as v[i]
.
(%i1) load ("descriptive")$ (%i2) s2 : read_matrix (file_search ("wind.data"))$
(%i3) subsample (s2, lambda([v], v[1] > 18)); [ 19.38 15.37 15.12 23.09 25.25 ] [ ] [ 18.29 18.66 19.08 26.08 27.63 ] (%o3) [ ] [ 20.25 21.46 19.95 27.71 23.38 ] [ ] [ 18.79 18.96 14.46 26.38 21.84 ]
In the following example, we request only the first, second and fifth components of those records with wind speeds greater or equal than 16 in station number 1 and less than 25 knots in station number 4. The sample contains only data from stations 1, 2 and 5. In this case, the predicate function is defined as an ordinary Maxima function.
(%i1) load ("descriptive")$ (%i2) s2 : read_matrix (file_search ("wind.data"))$ (%i3) g(x):= x[1] >= 16 and x[4] < 25$
(%i4) subsample (s2, g, 1, 2, 5); [ 19.38 15.37 25.25 ] [ ] [ 17.33 14.67 19.58 ] (%o4) [ ] [ 16.92 13.21 21.21 ] [ ] [ 17.25 18.46 23.87 ]
Here is an example with the categorical variables of biomed.data
.
We want the records corresponding to those patients in group B
who are older than 38 years.
(%i1) load ("descriptive")$ (%i2) s3 : read_matrix (file_search ("biomed.data"))$ (%i3) h(u):= u[1] = B and u[2] > 38 $
(%i4) subsample (s3, h); [ B 39 28.0 102.3 17.1 146 ] [ ] [ B 39 21.0 92.4 10.3 197 ] [ ] [ B 39 23.0 111.5 10.0 133 ] [ ] [ B 39 26.0 92.6 12.3 196 ] (%o4) [ ] [ B 39 25.0 98.7 10.0 174 ] [ ] [ B 39 21.0 93.2 5.9 181 ] [ ] [ B 39 18.0 95.0 11.3 66 ] [ ] [ B 39 39.0 88.5 7.6 168 ]
Probably, the statistical analysis will involve only the blood measures,
(%i1) load ("descriptive")$ (%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) subsample (s3, lambda([v], v[1] = B and v[2] > 38), 3, 4, 5, 6); [ 28.0 102.3 17.1 146 ] [ ] [ 21.0 92.4 10.3 197 ] [ ] [ 23.0 111.5 10.0 133 ] [ ] [ 26.0 92.6 12.3 196 ] (%o3) [ ] [ 25.0 98.7 10.0 174 ] [ ] [ 21.0 93.2 5.9 181 ] [ ] [ 18.0 95.0 11.3 66 ] [ ] [ 39.0 88.5 7.6 168 ]
This is the multivariate mean of s3
,
(%i1) load ("descriptive")$ (%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) mean (s3); 65 B + 35 A 317 6 NA + 8144.999999999999 (%o3) [-----------, ---, 87.178, ------------------------, 100 10 100 3 NA + 19587 18.123, ------------] 100
Here, the first component is meaningless, since A
and B
are categorical, the second component is the mean age of individuals in rational form, and the fourth and last values exhibit some strange behaviour. This is because symbol NA
is used here to indicate non available data, and the two means are nonsense. A possible solution would be to take out from the matrix those rows with NA
symbols, although this deserves some loss of information.
(%i1) load ("descriptive")$ (%i2) s3 : read_matrix (file_search ("biomed.data"))$ (%i3) g(v):= v[4] # NA and v[6] # NA $
(%i4) mean (subsample (s3, g, 3, 4, 5, 6)); (%o4) [79.4923076923077, 86.2032967032967, 16.93186813186813, 2514 ----] 13
Transforms the sample matrix, where each column is called according to varlist, following expressions in exprlist.
Examples:
The second argument assigns names to the three columns. With these names, a list of expressions define the transformation of the sample.
(%i1) load ("descriptive")$ (%i2) data: matrix([3,2,7],[3,7,2],[8,2,4],[5,2,4]) $
(%i3) transform_sample(data, [a,b,c], [c, a*b, log(a)]); [ 7 6 log(3) ] [ ] [ 2 21 log(3) ] (%o3) [ ] [ 4 16 log(8) ] [ ] [ 4 10 log(5) ]
Add a constant column and remove the third variable.
(%i1) load ("descriptive")$ (%i2) data: matrix([3,2,7],[3,7,2],[8,2,4],[5,2,4]) $ (%i3) transform_sample(data, [a,b,c], [makelist(1,k,length(data)),a,b]);
[ 1 3 2 ] [ ] [ 1 3 7 ] (%o3) [ ] [ 1 8 2 ] [ ] [ 1 5 2 ]
Next: Functions and Variables for descriptive statistics, Previous: Introduction to descriptive, Up: descriptive [Contents][Index]