imsegm.classification module¶

Supporting file to create and set parameters for scikit-learn classifiers and some prepossessing functions that support classification

class imsegm.classification.CrossValidate(nb_samples, nb_hold_out, rand_seed=None, ignore_overflow=0.01)[source]¶

Bases: object

Cross-validator generator. In the hold-out, the data is split only once into a train set and a test set.

Parameters

nb_samples (integer, total number of samples) –
nb_hold_out (integer, number of samples hold out) –
rand_seed (seed for the random number generator) –
ignore_overflow (float, tolerance while dividing dataset to folds) –

Examples

>>> # balanced split
>>> cv = CrossValidate(6, 3, rand_seed=False)
>>> cv.indexes
[0, 1, 2, 3, 4, 5]
>>> len(cv)
2
>>> list(cv)  
[([3, 4, 5], [0, 1, 2]),
 ([0, 1, 2], [3, 4, 5])]
>>> [(len(tr), len(ts)) for tr, ts in CrossValidate(340, 0.41)]
[(201, 139), (201, 139), (201, 139)]

>>> # not rounded split
>>> cv = CrossValidate(7, 3, rand_seed=0)
>>> list(cv)  
[([3, 0, 5, 4], [6, 2, 1]),
 ([6, 2, 1, 4], [3, 0, 5]),
 ([1, 3, 0, 5], [4, 6, 2])]
>>> len(cv)
3
>>> cv.indexes
[6, 2, 1, 3, 0, 5, 4]

>>> # larger test then train
>>> cv = CrossValidate(7, 5, rand_seed=0)
>>> list(cv)  
[([6, 2], [1, 3, 0, 5, 4]),
 ([1, 3], [6, 2, 0, 5, 4]),
 ([0, 5], [6, 2, 1, 3, 4]),
 ([4, 6], [2, 1, 3, 0, 5])]
>>> [(len(tr), len(ts)) for tr, ts in CrossValidate(340, 0.55)]
[(153, 187), (153, 187), (153, 187)]

>>> # impact of tolerance
>>> len(CrossValidate(340, 0.33, ignore_overflow=0.0))
4
>>> len(CrossValidate(340, 0.33, ignore_overflow=0.05))
3

>>> [(len(tr), len(ts)) for tr, ts in CrossValidate(4651, 0.25, ignore_overflow=0.)]
[(3488, 1163), (3488, 1163), (3488, 1163), (3488, 1163)]
>>> [(len(tr), len(ts)) for tr, ts in CrossValidate(4651, 0.25, ignore_overflow=1e-2)]
[(3488, 1163), (3488, 1163), (3488, 1163), (3489, 1162)]

constructor

Parameters

nb_samples (int) – list of sizes
nb_hold_out (int|float) – how much hold out
rand_seed (int|None) – random seed for shuffling
ignore_overflow (float) – tolerance while dividing dataset to folds

__steps()[source]¶

adjust this iterator, tol_balance

Return list(int): indexes of steps

class imsegm.classification.CrossValidateGroups(set_sizes, nb_hold_out, rand_seed=None, ignore_overflow=0.01)[source]¶

Bases: imsegm.classification.CrossValidate

Cross-validator generator. In the hold-out, the data is split only once into a train set and a test set.

Parameters

set_sizes (list of integers, number of samples in each set) –
nb_hold_out (integer, number of sets hold out) –
rand_seed (seed for the random number generator) –
ignore_overflow (float, tolerance while dividing dataset to folds) –

Examples

>>> # balance split
>>> cv = CrossValidateGroups([2, 3, 2, 3], 2, rand_seed=False)
>>> cv.set_indexes
[[0, 1], [2, 3, 4], [5, 6], [7, 8, 9]]
>>> len(cv)
2
>>> list(cv)  
[([5, 6, 7, 8, 9], [0, 1, 2, 3, 4]),
 ([0, 1, 2, 3, 4], [5, 6, 7, 8, 9])]
>>> [(len(tr), len(ts)) for tr, ts in CrossValidateGroups([7] * 340, 0.41)]
[(1407, 973), (1407, 973), (1407, 973)]

>>> # unbalanced split
>>> cv = CrossValidateGroups([2, 2, 1, 2, 1], 2, rand_seed=0)
>>> cv.set_indexes
[[0, 1], [2, 3], [4], [5, 6], [7]]
>>> list(cv)  
[([2, 3, 5, 6, 7], [4, 0, 1]),
 ([4, 0, 1, 7], [2, 3, 5, 6]),
 ([0, 1, 2, 3, 5, 6], [7, 4])]
>>> len(cv)
3
>>> cv.indexes
[2, 0, 1, 3, 4]

>>> # larger test then train
>>> cv = CrossValidateGroups([2, 2, 1, 2, 1, 1], 4, rand_seed=0)
>>> list(cv)  
[([8, 4], [2, 3, 5, 6, 0, 1, 7]),
 ([2, 3, 5, 6], [8, 4, 0, 1, 7]),
 ([0, 1, 7], [8, 4, 2, 3, 5, 6])]
>>> [(len(tr), len(ts)) for tr, ts in CrossValidateGroups([7] * 340, 0.55)]
[(1071, 1309), (1071, 1309), (1071, 1309)]

construct

Parameters

set_sizes (list(int)) – list of sizes
nb_hold_out (int|float) – how much hold out
rand_seed (int|None) – random seed for shuffling
ignore_overflow (float) – tolerance while dividing dataset to folds

__iter_indexes(sets)[source]¶

return enrol indexes from sets

Parameters: sets (list(int)) – selection of indexes
Return list(int)

class imsegm.classification.HoldOut(nb_samples, hold_out, rand_seed=0)[source]¶

Bases: object

Hold-out cross-validator generator. In the hold-out, the data is split only once into a train set and a test set. Unlike in other cross-validation schemes, the hold-out consists of only one iteration.

Parameters

nb_samples (int, total number of samples) –
hold_out (int, number where the test starts) –
rand_seed (seed for the random number generator) –

Example

>>> ho = HoldOut(10, 7, rand_seed=None)
>>> len(ho)
1
>>> list(ho)
[([0, 1, 2, 3, 4, 5, 6], [7, 8, 9])]
>>> ho = HoldOut(10, 7, rand_seed=0)
>>> list(ho)
[([2, 8, 4, 9, 1, 6, 7], [3, 0, 5])]

constructor

Parameters

nb_samples (int) – total number of samples
hold_out (int) – index where the test starts
rand_seed (obj) – Seed for the random number generator.

imsegm.classification.balance_dataset_by_(features, labels, balance_type='random', min_samples=None)[source]¶

balance number of training examples per class by several method

Parameters

features (ndarray) – features in dimension nb_samples x nb_features
labels (list(int)) – annotation for samples
balance_type (str) – type of balancing dataset
min_samples (int|None) – if None take the smallest class

Return tuple(ndarray,ndarray)

>>> np.random.seed(0)
>>> fts, lbs = balance_dataset_by_(np.random.random((25, 3)), np.random.randint(0, 2, 25))
>>> fts.shape
(24, 3)
>>> lbs
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

imsegm.classification.compose_dict_label_features(features, labels)[source]¶

convert vector of features and related labels to a dictionary of features where key is the lables

Parameters

features (ndarray) – features in dimension nb_samples x nb_features
labels (list(int)) – annotation for samples

Return dict(int,ndarray)

{int: np.array<nb, nb_features>}

imsegm.classification.compute_classif_metrics(y_true, y_pred, metric_averages=('macro', 'weighted'))[source]¶

compute standard metrics for multi-class classification

Parameters

y_true (list(int)) –
y_pred (list(int)) –
metric_averages (str|list(str)) –

Return dict(float)

>>> np.random.seed(0)
>>> y_true = np.random.randint(0, 3, 25) * 2
>>> y_pred = np.random.randint(0, 2, 25) * 2
>>> d = compute_classif_metrics(y_true, y_true)
>>> d['accuracy']
1.0
>>> d['confusion']
[[10, 0, 0], [0, 10, 0], [0, 0, 5]]
>>> d = compute_classif_metrics(y_true, y_pred)
>>> d['accuracy']  
0.32...
>>> d['confusion']
[[3, 7, 0], [5, 5, 0], [1, 4, 0]]
>>> d = compute_classif_metrics(y_pred, y_pred)
>>> d['accuracy']
1.0

imsegm.classification.compute_classif_stat_segm_annot(annot_segm_name, drop_labels=None, relabel=False)[source]¶

compute classification statistic between annotation and segmentation

Parameters

annot_segm_name (tuple(ndarray,ndarray,str)) –
drop_labels (list(int)) – labels to be ignored
relabel (bool) – whether relabel

Returns

>>> np.random.seed(0)
>>> annot = np.random.randint(0, 2, (5, 10))
>>> segm = np.random.randint(0, 2, (5, 10))
>>> d = compute_classif_stat_segm_annot((annot, annot, 'ttt'), relabel=True, drop_labels=[5])
>>> d['(FP+FN)/(TP+FN)']  
0.0
>>> d['(TP+FP)/(TP+FN)']  
1.0
>>> d = compute_classif_stat_segm_annot((annot, segm, 'ttt'), relabel=True, drop_labels=[5])
>>> d['(FP+FN)/(TP+FN)']  
0.846...
>>> d['(TP+FP)/(TP+FN)']  
1.153...
>>> d = compute_classif_stat_segm_annot((annot, segm + 1, 'ttt'), relabel=False, drop_labels=[0])
>>> d['confusion']
[[13, 17], [0, 0]]

imsegm.classification.compute_metric_fpfn_tpfn(annot, segm, label_positive=None)[source]¶

compute measure (FP + FN) / (TP + FN)

Parameters

annot (ndarray) – annotation
segm (ndarray) – segmentation
label_positive (int) – indexes of positive labels

Return float

>>> np.random.seed(0)
>>> annot = np.random.randint(0, 2, (50, 75)) * 3
>>> segm = np.random.randint(0, 2, (50, 75)) * 3
>>> compute_metric_fpfn_tpfn(annot, segm)  
1.02...
>>> compute_metric_fpfn_tpfn(annot, annot)
0.0
>>> compute_metric_fpfn_tpfn(annot, np.ones((50, 75)))
nan

imsegm.classification.compute_metric_tpfp_tpfn(annot, segm, label_positive=None)[source]¶

compute measure (TP + FP) / (TP + FN)

Parameters

annot (ndarray) –
segm (ndarray) –
label_positive (int) –

Return float

>>> np.random.seed(0)
>>> annot = np.random.randint(0, 2, (50, 75)) * 3
>>> segm = np.random.randint(0, 2, (50, 75)) * 3
>>> compute_metric_tpfp_tpfn(annot, segm)  
1.03...
>>> compute_metric_tpfp_tpfn(annot, annot)
1.0
>>> compute_metric_tpfp_tpfn(annot, np.ones((50, 75)))
nan
>>> compute_metric_tpfp_tpfn(annot, np.zeros((50, 75)))
0.0

imsegm.classification.compute_stat_per_image(segms, annots, names=None, nb_workers=2, drop_labels=None, relabel=False)[source]¶

compute statistic over multiple segmentations with annotation

Parameters

segms (list(ndarray)) – segmntations
annots (list(ndarray)) – annotations
names (list(str)) – list of names
drop_labels (list(int)) – labels to be ignored
relabel (bool) – whether relabel
nb_workers (int) – running jobs in parallel

Return DF

>>> np.random.seed(0)
>>> img_true = np.random.randint(0, 3, (50, 100))
>>> img_pred = np.random.randint(0, 2, (50, 100))
>>> df = compute_stat_per_image([img_true], [img_true], nb_workers=2, relabel=True)
>>> from pprint import pprint
>>> pprint(pd.Series(df.iloc[0]).sort_index().to_dict())  
{'ARS': 1.0,
 'accuracy': 1.0,
 'confusion': [[1672, 0, 0], [0, 1682, 0], [0, 0, 1646]],
 'f1_macro': 1.0,
 'precision_macro': 1.0,
 'recall_macro': 1.0,
 'support_macro': None}
>>> df = compute_stat_per_image([img_true], [img_pred], drop_labels=[-1])
>>> pd.Series(df.round(4).iloc[0]).sort_index()  
ARS                                                       0.0002
accuracy                                                  0.3384
confusion          [[836, 826, 770], [836, 856, 876], [0, 0, 0]]
f1_macro                                                  0.2701
precision_macro                                           0.3363
recall_macro                                              0.2257
support_macro                                               None
Name: 0, dtype: object

imsegm.classification.compute_tp_tn_fp_fn(annot, segm, label_positive=None)[source]¶

compute measure TruePositive, TrueNegative, FalsePositive, FalseNegative

Parameters

annot (ndarray) – annotation
segm (ndarray) – segmentation
label_positive (int) – indexes of positive labels

Return tuple(float,float,float,float)

>>> np.random.seed(0)
>>> annot = np.random.randint(0, 2, (5, 7)) * 9
>>> segm = np.random.randint(0, 2, (5, 7)) * 9
>>> annot - segm
array([[-9,  9,  0, -9,  9,  9,  0],
       [ 9,  0,  0,  0, -9, -9,  9],
       [-9,  0, -9, -9, -9,  0,  0],
       [ 0,  9,  0, -9,  0,  9,  0],
       [ 9, -9,  9,  0,  9,  0,  9]])
>>> compute_tp_tn_fp_fn(annot, annot)
(20, 15, 0, 0)
>>> compute_tp_tn_fp_fn(annot, segm)
(9, 5, 11, 10)
>>> compute_tp_tn_fp_fn(annot, np.ones((5, 7)))
(nan, nan, nan, nan)
>>> compute_tp_tn_fp_fn(np.zeros((5, 7)), np.zeros((5, 7)))
(35, 0, 0, 0)

imsegm.classification.convert_dict_label_features_2_vectors(dict_features)[source]¶

convert dictionary of features where key is the labels to vector of all features and related labels

Parameters: dict_features (dict(int,list(list(float)))) – {int: [list(float) * nb_features] * nb_samples}
Return tuple(ndarray,list(int)): np.array<nb_samples, nb_features>, list(int)

imsegm.classification.convert_set_features_labels_2_dataset(imgs_features, imgs_labels, drop_labels=None, balance_type=None)[source]¶

with dictionary for each image we concentrate all features over images and labels into simple form

Parameters

imgs_features (dict(str,ndarray)) – dictionary of name and features
imgs_labels (dict(str,ndarray)) – dictionary of name and labels
drop_labels (list(int)) – labels to be ignored
balance_type (bool) – whether balance_type number of sampler per class

Return tuple(ndarray,ndarray,ndarray)

>>> np.random.seed(0)
>>> d_fts = {'a': np.random.random((25, 3)),
...          'b': np.random.random((30, 3)), }
>>> d_lbs = {'a': np.random.randint(0, 2, 25),
...          'b': np.random.randint(0, 2, 30)}
>>> fts, lbs, sizes = convert_set_features_labels_2_dataset(d_fts, d_lbs)
>>> fts.shape
(55, 3)
>>> lbs.shape
(55,)
>>> sizes
[25, 30]

imsegm.classification.create_classif_search(name_clf, clf_pipeline, nb_labels, search_type='random', cross_val=10, eval_metric='f1', nb_iter=250, nb_workers=5)[source]¶

create sklearn search depending on spec. random or grid

Parameters

nb_labels (int) – number of labels
search_type (str) – hyper-params search type
eval_metric (str) – evaluation metric
nb_iter (int) – for random number of tries
name_clf (str) – name of classif.
clf_pipeline (obj) – object
cross_val (obj) – obj specific CV for fix train-test
nb_workers (int) – number jobs running in parallel

Returns

imsegm.classification.create_classif_search_train_export(clf_name, features, labels, cross_val=10, nb_search_iter=100, search_type='random', eval_metric='f1', nb_workers=1, path_out=None, params=None, pca_coef=0.98, feature_names=None, label_names=None)[source]¶

create classifier and train it once or find best parameters. whether tha path out is given export it for later use

Parameters

clf_name (str) – name of selected classifier
features (ndarray) – features in dimension nb_samples x nb_features
labels (list(int)) – annotation for samples
cross_val (int|obj) – Cross validation
search_type (str) – search type
eval_metric (str) – evaluation metric
params (dict) – extra parameters
pca_coef (float) – sklearn PCA - int/float/None
nb_search_iter (int) – number of searcher for hyper-parameters
path_out (str) – path to directory for exporting classifier
nb_workers (int) – parallel processes
feature_names (list(str)) – list of extracted features - names
label_names (list(str)) – list of label names

Returns

(obj, str): classifier, path to the exported classifier

>>> np.random.seed(0)
>>> lbs = np.random.randint(0, 3, 150)
>>> fts = np.random.random((150, 5)) + np.tile(lbs, (5, 1)).T
>>> _, _ = create_classif_search_train_export('LogistRegr', fts, lbs, nb_search_iter=0)
>>> clf, p_clf = create_classif_search_train_export('AdaBoost', fts, lbs,
...     nb_search_iter=2, path_out='', search_type='grid')  
Fitting ...
>>> clf  
Pipeline(...)
>>> clf, p_clf = create_classif_search_train_export('RandForest', fts, lbs,
...     nb_search_iter=2, path_out='.', search_type='random')  
Fitting ...
>>> clf  
Pipeline(...)
>>> os.path.basename(p_clf)
'classifier_RandForest.pkl'
>>> os.remove(p_clf)
>>> import glob
>>> files = glob.glob(os.path.join('.', 'classif_*.txt'))
>>> sorted([os.path.basename(fp) for fp in files])  
['classif_RandForest_search_params_best.txt',
 'classif_RandForest_search_params_scores.txt']
>>> for p in files: os.remove(p)

imsegm.classification.create_classifiers(nb_workers=- 1)[source]¶

create all classifiers with default parameters

Parameters: nb_workers (int) – number of parallel if possible
Return dict(str,clf) dict

>>> classifs = create_classifiers()
>>> classifs  
{...}
>>> sum([isinstance(create_clf_param_search_grid(k), dict) for k in classifs.keys()])
7
>>> sum([isinstance(create_clf_param_search_distrib(k), dict) for k in classifs.keys()])
7

imsegm.classification.create_clf_param_search_distrib(name_classif='RandForest')[source]¶

create parameter distribution for random search

Parameters: name_classif (str) – key name of classifier
Returns: dict

>>> create_clf_param_search_distrib()  
{...}
>>> dict_classif = create_classifiers()
>>> all(len(create_clf_param_search_distrib(k)) > 0 for k in dict_classif)
True
>>> create_clf_param_search_distrib('none')
{}

imsegm.classification.create_clf_param_search_grid(name_classif='RandForest')[source]¶

create parameter grid for search

Parameters: name_classif (str) – key name of selected classifier
Returns: dict

>>> create_clf_param_search_grid('RandForest') 
{'classif__...': ...}
>>> dict_classif = create_classifiers()
>>> all(len(create_clf_param_search_grid(k)) > 0 for k in dict_classif)
True
>>> create_clf_param_search_grid('none')
{}

imsegm.classification.create_clf_pipeline(name_classif='RandForest', pca_coef=0.95)[source]¶

create complete pipeline with all required steps

Parameters

pca_coef (int|float|None) – sklearn PCA
name_classif (str) – key name of classif.

Returns

object

>>> create_clf_pipeline()  
Pipeline(...)

imsegm.classification.create_pipeline_neuron_net()[source]¶

create classifier for simple neuronal network

Returns: clf

>>> create_pipeline_neuron_net()  
Pipeline(...)

imsegm.classification.down_sample_dict_features_kmean(dict_features, nb_samples)[source]¶

cluser with kmeans the features with nb cluster == given nb_samples and the retirn features which are closer to each cluster center

Parameters

dict_features (dict) – {int: [list(float) * nb_features] * nb}
nb_samples (int) –

Return dict

{int: [list(float) * nb_features] * nb_samples}

>>> np.random.seed(0)
>>> d_fts = {'a': np.random.random((100, 3))}
>>> d_fts = down_sample_dict_features_kmean(d_fts, 5)
>>> d_fts['a'].shape
(5, 3)

imsegm.classification.down_sample_dict_features_random(dict_features, nb_samples)[source]¶

browse all label features and take random subset of features to have given nb_samples per class

Parameters

dict_features (dict) – {int: [list(float) * nb_features] * nb}
nb_samples (int) –

Return dict

{int: [list(float) * nb_features] * nb_samples}

>>> np.random.seed(0)
>>> d_fts = {'a': np.random.random((100, 3))}
>>> d_fts = down_sample_dict_features_random(d_fts, 5)
>>> d_fts['a'].shape
(5, 3)

imsegm.classification.down_sample_dict_features_unique(dict_features)[source]¶

browse all label features and take unique features

Parameters: dict_features (dict) – {int: [list(float) * nb_features] * nb_samples}
Return dict: {int: [list(float) * nb_features] * nb}

>>> np.random.seed(0)
>>> d_fts = {'a': np.random.random((100, 3))}
>>> d_fts = down_sample_dict_features_unique(d_fts)
>>> d_fts['a'].shape
(100, 3)

imsegm.classification.eval_classif_cross_val_roc(clf_name, classif, features, labels, cross_val, path_out=None, nb_steps=100)[source]¶

compute mean ROC curve on cross-validation schema

http://scikit-learn.org/0.15/auto_examples/plot_roc_crossval.html

Parameters

clf_name (str) – name of selected classifier
classif (obj) – sklearn classifier
features (ndarray) – features in dimension nb_samples x nb_features
labels (list(int)) – annotation for samples
cross_val (object) –
path_out (str) – path for exporting statistic
nb_steps (int) – number of thresholds

Returns

>>> np.random.seed(0)
>>> labels = np.array([0] * 150 + [1] * 100 + [3] * 50)
>>> data = np.tile(labels, (6, 1)).T.astype(float)
>>> data += np.random.random(data.shape)
>>> data.shape
(300, 6)
>>> from sklearn.model_selection import StratifiedKFold
>>> cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
>>> classif = create_classifiers()[DEFAULT_CLASSIF_NAME]
>>> fp_tp, auc = eval_classif_cross_val_roc(DEFAULT_CLASSIF_NAME, classif, data, labels, cv, nb_steps=11)
>>> fp_tp
     FP   TP
0   0.0  0.0
1   0.1  1.0
2   0.2  1.0
3   0.3  1.0
4   0.4  1.0
5   0.5  1.0
6   0.6  1.0
7   0.7  1.0
8   0.8  1.0
9   0.9  1.0
10  1.0  1.0
>>> auc  
0.94...
>>> labels[-50:] -= 1
>>> data[-50:, :] -= 1
>>> path_out = 'temp_eval-cv-roc'
>>> os.mkdir(path_out)
>>> fp_tp, auc = eval_classif_cross_val_roc(
...     DEFAULT_CLASSIF_NAME, classif, data, labels, cv, nb_steps=5, path_out=path_out)
>>> fp_tp
     FP   TP
0  0.00  0.0
1  0.25  1.0
2  0.50  1.0
3  0.75  1.0
4  1.00  1.0
>>> auc
0.875
>>> import shutil
>>> shutil.rmtree(path_out, ignore_errors=True)

imsegm.classification.eval_classif_cross_val_scores(clf_name, classif, features, labels, cross_val=10, path_out=None, scorings=('f1_macro', 'accuracy', 'precision_macro', 'recall_macro'))[source]¶

compute statistic on cross-validation schema

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

Parameters

clf_name (str) – name of selected classifier
classif (obj) – sklearn classifier
features (ndarray) – features in dimension nb_samples x nb_features
labels (list(int)) – annotation for samples
cross_val (object) –
path_out (str) – path for exporting statistic
scorings (list(str)) – list of used scorings

Return DF

>>> labels = np.array([0] * 150 + [1] * 100 + [2] * 50)
>>> data = np.tile(labels, (6, 1)).T.astype(float)
>>> data += 0.5 - np.random.random(data.shape)
>>> data.shape
(300, 6)
>>> from sklearn.model_selection import StratifiedKFold
>>> cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
>>> classif = create_classifiers()[DEFAULT_CLASSIF_NAME]
>>> df = eval_classif_cross_val_scores(DEFAULT_CLASSIF_NAME, classif, data, labels, cv)
>>> df.round(decimals=1)
   f1_macro  accuracy  precision_macro  recall_macro
0       1.0       1.0              1.0           1.0
1       1.0       1.0              1.0           1.0
2       1.0       1.0              1.0           1.0
3       1.0       1.0              1.0           1.0
4       1.0       1.0              1.0           1.0
>>> labels[labels == 1] = 2
>>> cv = StratifiedKFold(n_splits=3, random_state=0, shuffle=True)
>>> df = eval_classif_cross_val_scores(DEFAULT_CLASSIF_NAME, classif, data, labels, cv, path_out='.')
>>> df.round(decimals=1)
   f1_macro  accuracy  precision_macro  recall_macro
0       1.0       1.0              1.0           1.0
1       1.0       1.0              1.0           1.0
2       1.0       1.0              1.0           1.0
>>> import glob
>>> p_files = glob.glob(NAME_CSV_CLASSIF_CV_SCORES.replace('{}', '*'))
>>> sorted(p_files)  
['classif_RandForest_cross-val_scores-all-folds.csv',
 'classif_RandForest_cross-val_scores-statistic.csv']
>>> [os.remove(p) for p in p_files]  
[...]

imsegm.classification.export_results_clf_search(path_out, clf_name, clf_search)[source]¶

do the final testing and save all results

Parameters

path_out (str) – path to directory for exporting classifier
clf_name (str) – name of selected classifier
clf_search (object) –

imsegm.classification.feature_scoring_selection(features, labels, names=None, path_out='')[source]¶

find the best features and retrun the indexes http://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_recovery.html http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html

Parameters

features (ndarray) – np.array<nb_samples, nb_features>
labels (ndarray) – np.array<nb_samples, 1>
names (list(str)) –
path_out (str) –

Return tuple(list(int),DF)

indices, Dataframe with scoring

>>> from sklearn.datasets import make_classification
>>> features, labels = make_classification(
...     n_samples=250, n_features=5, n_informative=3, n_redundant=0, n_repeated=0,
...     n_classes=2, random_state=0, shuffle=False)
>>> indices, df_scoring = feature_scoring_selection(features, labels)  
>>> indices
array([1, 0, 2, 3, 4]...)
>>> df_scoring.sort_index(axis=1)  
         ExtTree    F-test    k-Best variance
feature
1        0.24...   0.75...   0.75...  2.49...
2        0.33...  58.94...  58.94...  1.85...
3        0.22...   2.24...   2.24...  1.54...
4        0.10...   4.02...   4.02...  0.96...
5        0.09...   0.02...   0.02...  1.01...
>>> features[:, 2] = 1
>>> path_out = 'test_fts-select'
>>> os.mkdir(path_out)
>>> indices, df_scoring = feature_scoring_selection(features.tolist(), labels.tolist(), path_out=path_out)
>>> indices  
array([1, 0, 3, 4, 2]...)
>>> import shutil
>>> shutil.rmtree(path_out, ignore_errors=True)

imsegm.classification.load_classifier(path_classif)[source]¶

estimate classifier for all data and export it

Parameters: path_classif (str) – path to the exported classifier
Return dict

>>> load_classifier('none.abc')

imsegm.classification.relabel_sequential(labels, uq_labels=None)[source]¶

relabel sequential vector staring from 0

Parameters

labels (list(int)) – all labels
uq_labels (list(int)) – unique labels

Return []

>>> relabel_sequential([0, 0, 0, 5, 5, 5, 0, 5])
[0, 0, 0, 1, 1, 1, 0, 1]

imsegm.classification.save_classifier(path_out, classif, clf_name, params, feature_names=None, label_names=None)[source]¶

estimate classif for all data and export it

Parameters

path_out (str) – path for exporting trained classofier
classif – sklearn classif.
clf_name (str) – name of selected classifier
feature_names (list(str)) – list of string names
params (dict) – extra parameters
label_names (list(str)) – list of string names of label_names

Return str

>>> clf = create_classifiers()['RandForest']
>>> p_clf = save_classifier('.', clf, 'TESTINNG', {})
>>> os.path.basename(p_clf)
'classifier_TESTINNG.pkl'
>>> d_clf = load_classifier(p_clf)
>>> sorted(d_clf.keys())
['clf_pipeline', 'features', 'label_names', 'name', 'params']
>>> d_clf['clf_pipeline']  
RandomForestClassifier(...)
>>> d_clf['name']
'TESTINNG'
>>> os.remove(p_clf)

imsegm.classification.search_params_cut_down_max_nb_iter(clf_parameters, nb_iter)[source]¶

create parameters list and count number of possible combination in case they are they are limited

Parameters

clf_parameters (dict) – dictionary with parameters
nb_iter (int) – nb of random tryes

Return int

>>> clf_params = create_clf_param_search_grid(DEFAULT_CLASSIF_NAME)
>>> search_params_cut_down_max_nb_iter(clf_params, 100)
100
>>> search_params_cut_down_max_nb_iter(clf_params, 1e6)
1450

imsegm.classification.shuffle_features_labels(features, labels)[source]¶

take the set of features and labels and shuffle them together while keeping link between feature and its label

Parameters

features (ndarray) – features in dimension nb_samples x nb_features
labels (list(int)) – annotation for samples

Returns

np.array<nb_samples, nb_features>, np.array<nb_samples>

>>> np.random.seed(0)
>>> fts = np.random.random((5, 2))
>>> lbs = np.random.randint(0, 2, 5)
>>> fts_new, lbs_new = shuffle_features_labels(fts, lbs)
>>> np.array_equal(fts, fts_new)
False
>>> np.array_equal(lbs, lbs_new)
False

imsegm.classification.unique_rows(data)[source]¶

with matrix detect unique row and return only them

Parameters: data (ndarray) – np.array
Return ndarray: np.array

imsegm.classification.DEFAULT_CLASSIF_NAME = 'RandForest'[source]¶: default (recommended) classifier for supervised segmentation

imsegm.classification.DEFAULT_CLUSTERING = 'kMeans'[source]¶: default (recommended) clustering for unsupervised segmentation

imsegm.classification.DICT_SCORING = {'accuracy': sklearn.metrics.accuracy_score, 'f1': sklearn.metrics.f1_score, 'precision': sklearn.metrics.precision_score, 'recall': sklearn.metrics.recall_score}[source]¶: mapping of metrics names to used functions

imsegm.classification.METRIC_AVERAGES = ('macro', 'weighted')[source]¶: default types of computed metrics

imsegm.classification.METRIC_SCORING = ('f1_macro', 'accuracy', 'precision_macro', 'recall_macro')[source]¶: default computed metrics

imsegm.classification.NAME_CSV_CLASSIF_CV_ROC = 'classif_{}_cross-val_ROC-{}.csv'[source]¶: exporting partial results about trained classifier - Receiver Operating Characteristics

imsegm.classification.NAME_CSV_CLASSIF_CV_SCORES = 'classif_{}_cross-val_scores-{}.csv'[source]¶: exporting partial results about trained classifier

imsegm.classification.NAME_CSV_FEATURES_SELECT = 'feature_selection.csv'[source]¶: file name of exported evaluation on feature quality

imsegm.classification.NAME_TXT_CLASSIF_CV_AUC = 'classif_{}_cross-val_AUC-{}.txt'[source]¶: exporting partial results about trained classifier - Area Under Curve

imsegm.classification.NB_WORKERS_SERACH = 1[source]¶: default number of workers

imsegm.classification.ROUND_UNIQUE_FTS_DIGITS = 3[source]¶: rounding unique features, in case to detail precision

imsegm.classification.TEMPLATE_NAME_CLF = 'classifier_{}.pkl'[source]¶: name template forexporting trained classifier (adding classifier name and version)