imsegm.classification module

Supporting file to create and set parameters for scikit-learn classifiers and some prepossessing functions that support classification

Copyright (C) 2014-2018 Jiri Borovec <jiri.borovec@fel.cvut.cz>

class imsegm.classification.CrossValidate(nb_samples, nb_hold_out, rand_seed=None, ignore_overflow=0.01)[source]

Bases: object

Cross-validator generator. In the hold-out, the data is split only once into a train set and a test set.

Parameters
  • nb_samples (integer, total number of samples) –

  • nb_hold_out (integer, number of samples hold out) –

  • rand_seed (seed for the random number generator) –

  • ignore_overflow (float, tolerance while dividing dataset to folds) –

Examples

>>> # balanced split
>>> cv = CrossValidate(6, 3, rand_seed=False)
>>> cv.indexes
[0, 1, 2, 3, 4, 5]
>>> len(cv)
2
>>> list(cv)  
[([3, 4, 5], [0, 1, 2]),
 ([0, 1, 2], [3, 4, 5])]
>>> [(len(tr), len(ts)) for tr, ts in CrossValidate(340, 0.41)]
[(201, 139), (201, 139), (201, 139)]
>>> # not rounded split
>>> cv = CrossValidate(7, 3, rand_seed=0)
>>> list(cv)  
[([3, 0, 5, 4], [6, 2, 1]),
 ([6, 2, 1, 4], [3, 0, 5]),
 ([1, 3, 0, 5], [4, 6, 2])]
>>> len(cv)
3
>>> cv.indexes
[6, 2, 1, 3, 0, 5, 4]
>>> # larger test then train
>>> cv = CrossValidate(7, 5, rand_seed=0)
>>> list(cv)  
[([6, 2], [1, 3, 0, 5, 4]),
 ([1, 3], [6, 2, 0, 5, 4]),
 ([0, 5], [6, 2, 1, 3, 4]),
 ([4, 6], [2, 1, 3, 0, 5])]
>>> [(len(tr), len(ts)) for tr, ts in CrossValidate(340, 0.55)]
[(153, 187), (153, 187), (153, 187)]
>>> # impact of tolerance
>>> len(CrossValidate(340, 0.33, ignore_overflow=0.0))
4
>>> len(CrossValidate(340, 0.33, ignore_overflow=0.05))
3
>>> [(len(tr), len(ts)) for tr, ts in CrossValidate(4651, 0.25, ignore_overflow=0.)]
[(3488, 1163), (3488, 1163), (3488, 1163), (3488, 1163)]
>>> [(len(tr), len(ts)) for tr, ts in CrossValidate(4651, 0.25, ignore_overflow=1e-2)]
[(3488, 1163), (3488, 1163), (3488, 1163), (3489, 1162)]

constructor

Parameters
  • nb_samples (int) – list of sizes

  • nb_hold_out (int|float) – how much hold out

  • rand_seed (int|None) – random seed for shuffling

  • ignore_overflow (float) – tolerance while dividing dataset to folds

_CrossValidate__steps()[source]

adjust this iterator, tol_balance

Return list(int)

indexes of steps

class imsegm.classification.CrossValidateGroups(set_sizes, nb_hold_out, rand_seed=None, ignore_overflow=0.01)[source]

Bases: imsegm.classification.CrossValidate

Cross-validator generator. In the hold-out, the data is split only once into a train set and a test set.

Parameters
  • set_sizes (list of integers, number of samples in each set) –

  • nb_hold_out (integer, number of sets hold out) –

  • rand_seed (seed for the random number generator) –

  • ignore_overflow (float, tolerance while dividing dataset to folds) –

Examples

>>> # balance split
>>> cv = CrossValidateGroups([2, 3, 2, 3], 2, rand_seed=False)
>>> cv.set_indexes
[[0, 1], [2, 3, 4], [5, 6], [7, 8, 9]]
>>> len(cv)
2
>>> list(cv)  
[([5, 6, 7, 8, 9], [0, 1, 2, 3, 4]),
 ([0, 1, 2, 3, 4], [5, 6, 7, 8, 9])]
>>> [(len(tr), len(ts)) for tr, ts in CrossValidateGroups([7] * 340, 0.41)]
[(1407, 973), (1407, 973), (1407, 973)]
>>> # unbalanced split
>>> cv = CrossValidateGroups([2, 2, 1, 2, 1], 2, rand_seed=0)
>>> cv.set_indexes
[[0, 1], [2, 3], [4], [5, 6], [7]]
>>> list(cv)  
[([2, 3, 5, 6, 7], [4, 0, 1]),
 ([4, 0, 1, 7], [2, 3, 5, 6]),
 ([0, 1, 2, 3, 5, 6], [7, 4])]
>>> len(cv)
3
>>> cv.indexes
[2, 0, 1, 3, 4]
>>> # larger test then train
>>> cv = CrossValidateGroups([2, 2, 1, 2, 1, 1], 4, rand_seed=0)
>>> list(cv)  
[([8, 4], [2, 3, 5, 6, 0, 1, 7]),
 ([2, 3, 5, 6], [8, 4, 0, 1, 7]),
 ([0, 1, 7], [8, 4, 2, 3, 5, 6])]
>>> [(len(tr), len(ts)) for tr, ts in CrossValidateGroups([7] * 340, 0.55)]
[(1071, 1309), (1071, 1309), (1071, 1309)]

construct

Parameters
  • set_sizes (list(int)) – list of sizes

  • nb_hold_out (int|float) – how much hold out

  • rand_seed (int|None) – random seed for shuffling

  • ignore_overflow (float) – tolerance while dividing dataset to folds

_CrossValidateGroups__iter_indexes(sets)[source]

return enrol indexes from sets

Parameters

sets (list(int)) – selection of indexes

Return list(int)

class imsegm.classification.HoldOut(nb_samples, hold_out, rand_seed=0)[source]

Bases: object

Hold-out cross-validator generator. In the hold-out, the data is split only once into a train set and a test set. Unlike in other cross-validation schemes, the hold-out consists of only one iteration.

Parameters
  • nb_samples (int, total number of samples) –

  • hold_out (int, number where the test starts) –

  • rand_seed (seed for the random number generator) –

Example

>>> ho = HoldOut(10, 7, rand_seed=None)
>>> len(ho)
1
>>> list(ho)
[([0, 1, 2, 3, 4, 5, 6], [7, 8, 9])]
>>> ho = HoldOut(10, 7, rand_seed=0)
>>> list(ho)
[([2, 8, 4, 9, 1, 6, 7], [3, 0, 5])]

constructor

Parameters
  • nb_samples (int) – total number of samples

  • hold_out (int) – index where the test starts

  • rand_seed (obj) – Seed for the random number generator.

imsegm.classification.balance_dataset_by_(features, labels, balance_type='random', min_samples=None)[source]

balance number of training examples per class by several method

Parameters
  • features (ndarray) – features in dimension nb_samples x nb_features

  • labels (list(int)) – annotation for samples

  • balance_type (str) – type of balancing dataset

  • min_samples (int|None) – if None take the smallest class

Return tuple(ndarray,ndarray)

>>> np.random.seed(0)
>>> fts, lbs = balance_dataset_by_(np.random.random((25, 3)),
...                                np.random.randint(0, 2, 25))
>>> fts.shape
(24, 3)
>>> lbs
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
imsegm.classification.compose_dict_label_features(features, labels)[source]

convert vector of features and related labels to a dictionary of features where key is the lables

Parameters
  • features (ndarray) – features in dimension nb_samples x nb_features

  • labels (list(int)) – annotation for samples

Return {int

ndarray}: {int: np.array<nb, nb_features>}

imsegm.classification.compute_classif_metrics(y_true, y_pred, metric_averages=('macro', 'weighted'))[source]

compute standard metrics for multi-class classification

Parameters
Return dict(float)

>>> np.random.seed(0)
>>> y_true = np.random.randint(0, 3, 25) * 2
>>> y_pred = np.random.randint(0, 2, 25) * 2
>>> d = compute_classif_metrics(y_true, y_true)
>>> d['accuracy']
1.0
>>> d['confusion']
[[10, 0, 0], [0, 10, 0], [0, 0, 5]]
>>> d = compute_classif_metrics(y_true, y_pred)
>>> d['accuracy']  
0.32...
>>> d['confusion']
[[3, 7, 0], [5, 5, 0], [1, 4, 0]]
>>> d = compute_classif_metrics(y_pred, y_pred)
>>> d['accuracy']
1.0
imsegm.classification.compute_classif_stat_segm_annot(annot_segm_name, drop_labels=None, relabel=False)[source]

compute classification statistic between annotation and segmentation

Parameters
  • annot_segm_name (tuple(ndarray,ndarray,str)) –

  • drop_labels (list(int)) – labels to be ignored

  • relabel (bool) – whether relabel

Returns

>>> np.random.seed(0)
>>> annot = np.random.randint(0, 2, (5, 10))
>>> segm = np.random.randint(0, 2, (5, 10))
>>> d = compute_classif_stat_segm_annot((annot, annot, 'ttt'), relabel=True,
...                                     drop_labels=[5])
>>> d['(FP+FN)/(TP+FN)']  
0.0
>>> d['(TP+FP)/(TP+FN)']  
1.0
>>> d = compute_classif_stat_segm_annot((annot, segm, 'ttt'), relabel=True,
...                                     drop_labels=[5])
>>> d['(FP+FN)/(TP+FN)']  
0.846...
>>> d['(TP+FP)/(TP+FN)']  
1.153...
>>> d = compute_classif_stat_segm_annot((annot, segm + 1, 'ttt'),
...                                     relabel=False, drop_labels=[0])
>>> d['confusion']
[[13, 17], [0, 0]]
imsegm.classification.compute_metric_fpfn_tpfn(annot, segm, label_positive=None)[source]

compute measure (FP + FN) / (TP + FN)

Parameters
  • annot (ndarray) – annotation

  • segm (ndarray) – segmentation

  • label_positive (int) – indexes of positive labels

Return float

>>> np.random.seed(0)
>>> annot = np.random.randint(0, 2, (50, 75)) * 3
>>> segm = np.random.randint(0, 2, (50, 75)) * 3
>>> compute_metric_fpfn_tpfn(annot, segm)  
1.02...
>>> compute_metric_fpfn_tpfn(annot, annot)
0.0
>>> compute_metric_fpfn_tpfn(annot, np.ones((50, 75)))
nan
imsegm.classification.compute_metric_tpfp_tpfn(annot, segm, label_positive=None)[source]

compute measure (TP + FP) / (TP + FN)

Parameters
  • annot (ndarray) –

  • segm (ndarray) –

  • label_positive (int) –

Return float

>>> np.random.seed(0)
>>> annot = np.random.randint(0, 2, (50, 75)) * 3
>>> segm = np.random.randint(0, 2, (50, 75)) * 3
>>> compute_metric_tpfp_tpfn(annot, segm)  
1.03...
>>> compute_metric_tpfp_tpfn(annot, annot)
1.0
>>> compute_metric_tpfp_tpfn(annot, np.ones((50, 75)))
nan
>>> compute_metric_tpfp_tpfn(annot, np.zeros((50, 75)))
0.0
imsegm.classification.compute_stat_per_image(segms, annots, names=None, nb_workers=2, drop_labels=None, relabel=False)[source]

compute statistic over multiple segmentations with annotation

Parameters
  • segms ([ndarray]) – segmntations

  • annots ([ndarray]) – annotations

  • names (list(str)) – list of names

  • drop_labels (list(int)) – labels to be ignored

  • relabel (bool) – whether relabel

  • nb_workers (int) – running jobs in parallel

Return DF

>>> np.random.seed(0)
>>> img_true = np.random.randint(0, 3, (50, 100))
>>> img_pred = np.random.randint(0, 2, (50, 100))
>>> df = compute_stat_per_image([img_true], [img_true], nb_workers=2, relabel=True)
>>> from pprint import pprint
>>> pprint(pd.Series(df.iloc[0]).sort_index().to_dict())  
{'ARS': 1.0,
 'accuracy': 1.0,
 'confusion': [[1672, 0, 0], [0, 1682, 0], [0, 0, 1646]],
 'f1_macro': 1.0,
 'precision_macro': 1.0,
 'recall_macro': 1.0,
 'support_macro': None}
>>> df = compute_stat_per_image([img_true], [img_pred], drop_labels=[-1])
>>> pd.Series(df.round(4).iloc[0]).sort_index()  
ARS                                                       0.0002
accuracy                                                  0.3384
confusion          [[836, 826, 770], [836, 856, 876], [0, 0, 0]]
f1_macro                                                  0.2701
precision_macro                                           0.3363
recall_macro                                              0.2257
support_macro                                               None
Name: 0, dtype: object
imsegm.classification.compute_tp_tn_fp_fn(annot, segm, label_positive=None)[source]

compute measure TruePositive, TrueNegative, FalsePositive, FalseNegative

Parameters
  • annot (ndarray) – annotation

  • segm (ndarray) – segmentation

  • label_positive (int) – indexes of positive labels

Return tuple(float,float,float,float)

>>> np.random.seed(0)
>>> annot = np.random.randint(0, 2, (5, 7)) * 9
>>> segm = np.random.randint(0, 2, (5, 7)) * 9
>>> annot - segm
array([[-9,  9,  0, -9,  9,  9,  0],
       [ 9,  0,  0,  0, -9, -9,  9],
       [-9,  0, -9, -9, -9,  0,  0],
       [ 0,  9,  0, -9,  0,  9,  0],
       [ 9, -9,  9,  0,  9,  0,  9]])
>>> compute_tp_tn_fp_fn(annot, annot)
(20, 15, 0, 0)
>>> compute_tp_tn_fp_fn(annot, segm)
(9, 5, 11, 10)
>>> compute_tp_tn_fp_fn(annot, np.ones((5, 7)))
(nan, nan, nan, nan)
>>> compute_tp_tn_fp_fn(np.zeros((5, 7)), np.zeros((5, 7)))
(35, 0, 0, 0)
imsegm.classification.convert_dict_label_features_2_vectors(dict_features)[source]

convert dictionary of features where key is the labels to vector of all features and related labels

Parameters

{int – [list(float)]} dict_features: {int: [list(float) * nb_features] * nb_samples}

Return tuple(ndarray,list(int))

np.array<nb_samples, nb_features>, list(int)

imsegm.classification.convert_set_features_labels_2_dataset(imgs_features, imgs_labels, drop_labels=None, balance_type=None)[source]

with dictionary for each image we concentrate all features over images and labels into simple form

Parameters
  • {str – ndarray} imgs_features: dictionary of name and features

  • {str – ndarray} imgs_labels: dictionary of name and labels

  • drop_labels (list(int)) – labels to be ignored

  • balance_type (bool) – whether balance_type number of sampler per class

Return tuple(ndarray,ndarray,ndarray)

>>> np.random.seed(0)
>>> d_fts = {'a': np.random.random((25, 3)),
...          'b': np.random.random((30, 3)), }
>>> d_lbs = {'a': np.random.randint(0, 2, 25),
...          'b': np.random.randint(0, 2, 30)}
>>> fts, lbs, sizes = convert_set_features_labels_2_dataset(d_fts, d_lbs)
>>> fts.shape
(55, 3)
>>> lbs.shape
(55,)
>>> sizes
[25, 30]

create sklearn search depending on spec. random or grid

Parameters
  • nb_labels (int) – number of labels

  • search_type (str) – hyper-params search type

  • eval_metric (str) – evaluation metric

  • nb_iter (int) – for random number of tries

  • name_clf (str) – name of classif.

  • clf_pipeline (obj) – object

  • cross_val (obj) – obj specific CV for fix train-test

  • nb_workers (int) – number jobs running in parallel

Returns

imsegm.classification.create_classif_search_train_export(clf_name, features, labels, cross_val=10, nb_search_iter=100, search_type='random', eval_metric='f1', nb_workers=1, path_out=None, params=None, pca_coef=0.98, feature_names=None, label_names=None)[source]

create classifier and train it once or find best parameters. whether tha path out is given export it for later use

Parameters
  • clf_name (str) – name of selected classifier

  • features (ndarray) – features in dimension nb_samples x nb_features

  • labels (list(int)) – annotation for samples

  • cross_val (int|obj) – Cross validation

  • search_type (str) – search type

  • eval_metric (str) – evaluation metric

  • params (dict) – extra parameters

  • pca_coef (float) – sklearn PCA - int/float/None

  • nb_search_iter (int) – number of searcher for hyper-parameters

  • path_out (str) – path to directory for exporting classifier

  • nb_workers (int) – parallel processes

  • feature_names (list(str)) – list of extracted features - names

  • label_names (list(str)) – list of label names

Returns

(obj, str): classifier, path to the exported classifier

>>> np.random.seed(0)
>>> lbs = np.random.randint(0, 3, 150)
>>> fts = np.random.random((150, 5)) + np.tile(lbs, (5, 1)).T
>>> _, _ = create_classif_search_train_export('LogistRegr', fts, lbs, nb_search_iter=0)
>>> clf, p_clf = create_classif_search_train_export('AdaBoost', fts, lbs,
...     nb_search_iter=2, path_out='', search_type='grid')  
Fitting ...
>>> clf  
Pipeline(...)
>>> clf, p_clf = create_classif_search_train_export('RandForest', fts, lbs,
...     nb_search_iter=2, path_out='.', search_type='random')  
Fitting ...
>>> clf  
Pipeline(...)
>>> p_clf
'./classifier_RandForest.pkl'
>>> os.remove(p_clf)
>>> import glob
>>> files = glob.glob(os.path.join('.', 'classif_*.txt'))
>>> sorted(files)  
['./classif_RandForest_search_params_best.txt',
 './classif_RandForest_search_params_scores.txt']
>>> for p in files: os.remove(p)
imsegm.classification.create_classifiers(nb_workers=-1)[source]

create all classifiers with default parameters

Parameters

nb_workers (int) – number of parallel if possible

Return dict

{str: clf}

>>> classifs = create_classifiers()
>>> classifs  
{...}
>>> sum([isinstance(create_clf_param_search_grid(k), dict)
...      for k in classifs.keys()])
7
>>> sum([isinstance(create_clf_param_search_distrib(k), dict)
...      for k in classifs.keys()])
7
imsegm.classification.create_clf_param_search_distrib(name_classif='RandForest')[source]

create parameter distribution for random search

Parameters

name_classif (str) – key name of classifier

Returns

dict

>>> create_clf_param_search_distrib()  
{...}
>>> dict_classif = create_classifiers()
>>> all(len(create_clf_param_search_distrib(k)) > 0 for k in dict_classif)
True
>>> create_clf_param_search_distrib('none')
{}
imsegm.classification.create_clf_param_search_grid(name_classif='RandForest')[source]

create parameter grid for search

Parameters

name_classif (str) – key name of selected classifier

Returns

dict

>>> create_clf_param_search_grid('RandForest') 
{'classif__...': ...}
>>> dict_classif = create_classifiers()
>>> all(len(create_clf_param_search_grid(k)) > 0 for k in dict_classif)
True
>>> create_clf_param_search_grid('none')
{}
imsegm.classification.create_clf_pipeline(name_classif='RandForest', pca_coef=0.95)[source]

create complete pipeline with all required steps

Parameters
  • pca_coef (int|float|None) – sklearn PCA

  • name_classif (str) – key name of classif.

Returns

object

>>> create_clf_pipeline()  
Pipeline(...)
imsegm.classification.create_pipeline_neuron_net()[source]

create classifier for simple neuronal network

Returns

clf

>>> create_pipeline_neuron_net()  
Pipeline(...)
imsegm.classification.down_sample_dict_features_kmean(dict_features, nb_samples)[source]

cluser with kmeans the features with nb cluster == given nb_samples and the retirn features which are closer to each cluster center

Parameters
  • dict_features (dict) – {int: [list(float) * nb_features] * nb}

  • nb_samples (int) –

Return dict

{int: [list(float) * nb_features] * nb_samples}

>>> np.random.seed(0)
>>> d_fts = {'a': np.random.random((100, 3))}
>>> d_fts = down_sample_dict_features_kmean(d_fts, 5)
>>> d_fts['a'].shape
(5, 3)
imsegm.classification.down_sample_dict_features_random(dict_features, nb_samples)[source]

browse all label features and take random subset of features to have given nb_samples per class

Parameters
  • dict_features (dict) – {int: [list(float) * nb_features] * nb}

  • nb_samples (int) –

Return dict

{int: [list(float) * nb_features] * nb_samples}

>>> np.random.seed(0)
>>> d_fts = {'a': np.random.random((100, 3))}
>>> d_fts = down_sample_dict_features_random(d_fts, 5)
>>> d_fts['a'].shape
(5, 3)
imsegm.classification.down_sample_dict_features_unique(dict_features)[source]

browse all label features and take unique features

Parameters

dict_features (dict) – {int: [list(float) * nb_features] * nb_samples}

Return dict

{int: [list(float) * nb_features] * nb}

>>> np.random.seed(0)
>>> d_fts = {'a': np.random.random((100, 3))}
>>> d_fts = down_sample_dict_features_unique(d_fts)
>>> d_fts['a'].shape
(100, 3)
imsegm.classification.eval_classif_cross_val_roc(clf_name, classif, features, labels, cross_val, path_out=None, nb_steps=100)[source]

compute mean ROC curve on cross-validation schema

http://scikit-learn.org/0.15/auto_examples/plot_roc_crossval.html

Parameters
  • clf_name (str) – name of selected classifier

  • classif (obj) – sklearn classifier

  • features (ndarray) – features in dimension nb_samples x nb_features

  • labels (list(int)) – annotation for samples

  • cross_val (object) –

  • path_out (str) – path for exporting statistic

  • nb_steps (int) – number of thresholds

Returns

>>> np.random.seed(0)
>>> labels = np.array([0] * 150 + [1] * 100 + [3] * 50)
>>> data = np.tile(labels, (6, 1)).T.astype(float)
>>> data += np.random.random(data.shape)
>>> data.shape
(300, 6)
>>> from sklearn.model_selection import StratifiedKFold
>>> cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
>>> classif = create_classifiers()[DEFAULT_CLASSIF_NAME]
>>> fp_tp, auc = eval_classif_cross_val_roc(DEFAULT_CLASSIF_NAME, classif, data, labels, cv, nb_steps=11)
>>> fp_tp
     FP   TP
0   0.0  0.0
1   0.1  1.0
2   0.2  1.0
3   0.3  1.0
4   0.4  1.0
5   0.5  1.0
6   0.6  1.0
7   0.7  1.0
8   0.8  1.0
9   0.9  1.0
10  1.0  1.0
>>> auc  
0.94...
>>> labels[-50:] -= 1
>>> data[-50:, :] -= 1
>>> path_out = 'temp_eval-cv-roc'
>>> os.mkdir(path_out)
>>> fp_tp, auc = eval_classif_cross_val_roc(
...     DEFAULT_CLASSIF_NAME, classif, data, labels, cv, nb_steps=5, path_out=path_out)
>>> fp_tp
     FP   TP
0  0.00  0.0
1  0.25  1.0
2  0.50  1.0
3  0.75  1.0
4  1.00  1.0
>>> auc
0.875
>>> import shutil
>>> shutil.rmtree(path_out, ignore_errors=True)
imsegm.classification.eval_classif_cross_val_scores(clf_name, classif, features, labels, cross_val=10, path_out=None, scorings=('f1_macro', 'accuracy', 'precision_macro', 'recall_macro'))[source]

compute statistic on cross-validation schema

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

Parameters
  • clf_name (str) – name of selected classifier

  • classif (obj) – sklearn classifier

  • features (ndarray) – features in dimension nb_samples x nb_features

  • labels (list(int)) – annotation for samples

  • cross_val (object) –

  • path_out (str) – path for exporting statistic

  • scorings (list(str)) – list of used scorings

Return DF

>>> labels = np.array([0] * 150 + [1] * 100 + [2] * 50)
>>> data = np.tile(labels, (6, 1)).T.astype(float)
>>> data += 0.5 - np.random.random(data.shape)
>>> data.shape
(300, 6)
>>> from sklearn.model_selection import StratifiedKFold
>>> cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
>>> classif = create_classifiers()[DEFAULT_CLASSIF_NAME]
>>> df = eval_classif_cross_val_scores(DEFAULT_CLASSIF_NAME, classif, data, labels, cv)
>>> df.round(decimals=1)
   f1_macro  accuracy  precision_macro  recall_macro
0       1.0       1.0              1.0           1.0
1       1.0       1.0              1.0           1.0
2       1.0       1.0              1.0           1.0
3       1.0       1.0              1.0           1.0
4       1.0       1.0              1.0           1.0
>>> labels[labels == 1] = 2
>>> cv = StratifiedKFold(n_splits=3, random_state=0, shuffle=True)
>>> df = eval_classif_cross_val_scores(DEFAULT_CLASSIF_NAME, classif, data, labels, cv, path_out='.')
>>> df.round(decimals=1)
   f1_macro  accuracy  precision_macro  recall_macro
0       1.0       1.0              1.0           1.0
1       1.0       1.0              1.0           1.0
2       1.0       1.0              1.0           1.0
>>> import glob
>>> p_files = glob.glob(NAME_CSV_CLASSIF_CV_SCORES.replace('{}', '*'))
>>> sorted(p_files)  
['classif_RandForest_cross-val_scores-all-folds.csv',
 'classif_RandForest_cross-val_scores-statistic.csv']
>>> [os.remove(p) for p in p_files]  
[...]

do the final testing and save all results

Parameters
  • path_out (str) – path to directory for exporting classifier

  • clf_name (str) – name of selected classifier

  • clf_search (object) –

imsegm.classification.feature_scoring_selection(features, labels, names=None, path_out='')[source]

find the best features and retrun the indexes http://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_recovery.html http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html

Parameters
  • features (ndarray) – np.array<nb_samples, nb_features>

  • labels (ndarray) – np.array<nb_samples, 1>

  • names (list(str)) –

  • path_out (str) –

Return tuple(list(int),DF)

indices, Dataframe with scoring

>>> from sklearn.datasets import make_classification
>>> features, labels = make_classification(n_samples=250, n_features=5,
...                                        n_informative=3, n_redundant=0,
...                                        n_repeated=0, n_classes=2,
...                                        random_state=0, shuffle=False)
>>> indices, df_scoring = feature_scoring_selection(features, labels)
>>> indices
array([1, 0, 2, 3, 4])
>>> df_scoring  
         ExtTree    F-test    k-Best variance
feature
1        0.24...   0.75...   0.75...  2.49...
2        0.33...  58.94...  58.94...  1.85...
3        0.22...   2.24...   2.24...  1.54...
4        0.10...   4.02...   4.02...  0.96...
5        0.09...   0.02...   0.02...  1.01...
>>> features[:, 2] = 1
>>> path_out = 'test_fts-select'
>>> os.mkdir(path_out)
>>> indices, df_scoring = feature_scoring_selection(features.tolist(), labels.tolist(), path_out=path_out)
>>> indices
array([1, 0, 3, 4, 2])
>>> import shutil
>>> shutil.rmtree(path_out, ignore_errors=True)
imsegm.classification.load_classifier(path_classif)[source]

estimate classifier for all data and export it

Parameters

path_classif (str) – path to the exported classifier

Return dict

>>> load_classifier('none.abc')
imsegm.classification.relabel_sequential(labels, uq_labels=None)[source]

relabel sequential vector staring from 0

Parameters
  • labels (list(int)) – all labels

  • uq_labels (list(int)) – unique labels

Return []

>>> relabel_sequential([0, 0, 0, 5, 5, 5, 0, 5])
[0, 0, 0, 1, 1, 1, 0, 1]
imsegm.classification.save_classifier(path_out, classif, clf_name, params, feature_names=None, label_names=None)[source]

estimate classif for all data and export it

Parameters
  • path_out (str) – path for exporting trained classofier

  • classif – sklearn classif.

  • clf_name (str) – name of selected classifier

  • feature_names (list(str)) – list of string names

  • params (dict) – extra parameters

  • label_names (list(str)) – list of string names of label_names

Return str

>>> clf = create_classifiers()['RandForest']
>>> p_clf = save_classifier('.', clf, 'TESTINNG', {})
>>> p_clf
'./classifier_TESTINNG.pkl'
>>> d_clf = load_classifier(p_clf)
>>> sorted(d_clf.keys())
['clf_pipeline', 'features', 'label_names', 'name', 'params']
>>> d_clf['clf_pipeline']  
RandomForestClassifier(...)
>>> d_clf['name']
'TESTINNG'
>>> os.remove(p_clf)
imsegm.classification.search_params_cut_down_max_nb_iter(clf_parameters, nb_iter)[source]

create parameters list and count number of possible combination in case they are they are limited

Parameters
  • clf_parameters (dict) – dictionary with parameters

  • nb_iter (int) – nb of random tryes

Return int

>>> clf_params = create_clf_param_search_grid(DEFAULT_CLASSIF_NAME)
>>> search_params_cut_down_max_nb_iter(clf_params, 100)
100
>>> search_params_cut_down_max_nb_iter(clf_params, 1e6)
1450
imsegm.classification.shuffle_features_labels(features, labels)[source]

take the set of features and labels and shuffle them together while keeping link between feature and its label

Parameters
  • features (ndarray) – features in dimension nb_samples x nb_features

  • labels (list(int)) – annotation for samples

Returns

np.array<nb_samples, nb_features>, np.array<nb_samples>

>>> np.random.seed(0)
>>> fts = np.random.random((5, 2))
>>> lbs = np.random.randint(0, 2, 5)
>>> fts_new, lbs_new = shuffle_features_labels(fts, lbs)
>>> np.array_equal(fts, fts_new)
False
>>> np.array_equal(lbs, lbs_new)
False
imsegm.classification.unique_rows(data)[source]

with matrix detect unique row and return only them

Parameters

data (ndarray) – np.array

Return ndarray

np.array

imsegm.classification.DEFAULT_CLASSIF_NAME = 'RandForest'[source]

default (recommended) classifier for supervised segmentation

imsegm.classification.DEFAULT_CLUSTERING = 'kMeans'[source]

default (recommended) clustering for unsupervised segmentation

imsegm.classification.DICT_SCORING = {'accuracy': sklearn.metrics.accuracy_score, 'f1': sklearn.metrics.f1_score, 'precision': sklearn.metrics.precision_score, 'recall': sklearn.metrics.recall_score}[source]

mapping of metrics names to used functions

imsegm.classification.METRIC_AVERAGES = ('macro', 'weighted')[source]

default types of computed metrics

imsegm.classification.METRIC_SCORING = ('f1_macro', 'accuracy', 'precision_macro', 'recall_macro')[source]

default computed metrics

imsegm.classification.NAME_CSV_CLASSIF_CV_ROC = 'classif_{}_cross-val_ROC-{}.csv'[source]

exporting partial results about trained classifier - Receiver Operating Characteristics

imsegm.classification.NAME_CSV_CLASSIF_CV_SCORES = 'classif_{}_cross-val_scores-{}.csv'[source]

exporting partial results about trained classifier

imsegm.classification.NAME_CSV_FEATURES_SELECT = 'feature_selection.csv'[source]

file name of exported evaluation on feature quality

imsegm.classification.NAME_TXT_CLASSIF_CV_AUC = 'classif_{}_cross-val_AUC-{}.txt'[source]

exporting partial results about trained classifier - Area Under Curve

imsegm.classification.NB_WORKERS_SERACH = 1[source]

default number of workers

imsegm.classification.ROUND_UNIQUE_FTS_DIGITS = 3[source]

rounding unique features, in case to detail precision

imsegm.classification.TEMPLATE_NAME_CLF = 'classifier_{}.pkl'[source]

name template forexporting trained classifier (adding classifier name and version)