modbot.training.models.SVM

class modbot.training.models.SVM(texts=None, model=None, **kwargs)[source]

Bases: ModerationModel

Linear SVM model

Methods

`clean_text`(text)	Clean single text so it is utf-8 compliant
`clean_texts`(X)	Clean texts so they are utf-8 compliant
`continue_training`(data_file, config)	Continue training.
`detailed_score`([test_gen, n_matches, out_dir])	Score model and print confusion matrix and multiple other metrics
`get_class_info`()	Get info about class numbers
`get_data_generators`(data_file, **kwargs)	Get data generators with correct sizes
`load`(inpath, **kwargs)	Load SVM model from path
`load_data`(data_file)	Load data from csv file
`model_test`()	Test model on some key phrases
`predict`(X[, verbose])	Predict classification
`predict_one`(X[, verbose])	Predict probability of label=1
`predict_proba`(X[, verbose])	Predict classification
`predict_zero`(X[, verbose])	Predict probability of label=0
`run`(data_file, config)	Run model pipeline.
`save`(outpath)	Save model
`save_params`(outpath, kwargs)	Save params to model path
`score`(X, Y)	Score model against targets
`split_data`(df[, test_split])	Split data into training and test sets
`train`(train_gen[, test_gen])	Train model
`transform`(X)	Transform texts

Attributes

PARAMS

Default parameters for tf-idf, svm, and calibrated classifier

PARAMS = {'C': 1, 'analyzer': 'char_wb', 'cv': 5, 'max_df': 1.0, 'max_iter': 10000, 'method': 'sigmoid', 'min_df': 1, 'ngram_range': (1, 8), 'smooth_idf': 1, 'stop_words': None, 'sublinear_tf': 1, 'tokenizer': None}: Default parameters for tf-idf, svm, and calibrated classifier

static clean_text(text)

Clean single text so it is utf-8 compliant

Parameters: text (str) – Text string to clean
Returns: Cleaned text string
Return type: str

classmethod clean_texts(X)

Clean texts so they are utf-8 compliant

Parameters: X (pd.DataFrame) – Pandas dataframe of texts
Returns: X – Pandas dataframe of cleaned texts
Return type: pd.DataFrame

classmethod continue_training(data_file, config)

Continue training. Load model, load data, tokenize texts, and train.

Parameters

data_file (str) – Path to csv file storing texts and labels
config (RunConfig) – Config class with kwargs

Returns

Trained sequential and evaluated model

Return type

keras.Sequential

detailed_score(test_gen=None, n_matches=10, out_dir=None)

Score model and print confusion matrix and multiple other metrics

Parameters

test_gen (WeightedGenerator) – generator for test data
n_matches (int) – Number of positive matches to print
out_dir (str | None) – Path to save scores

Returns

df_scores – A dataframe containing all model scores

Return type

pd.DataFrame

get_class_info(): Get info about class numbers

classmethod get_data_generators(data_file, **kwargs)

Get data generators with correct sizes

Parameters

data_file (str) – Path to csv file storing texts and labels
kwargs (dict) – Dictionary with optional keyword parameters. Can include sample_size, batch_size, epochs, n_batches.

Returns

train_gen (WeightedGenerator) – WeightedGenerator instance used for training batches
test_gen (WeightedGenerator) – WeightedGenerator instance used for evaluation batches

classmethod load(inpath, **kwargs)[source]

Load SVM model from path

Parameters: inpath (str) – Path to load model from
Returns: Previously trained and saved model
Return type: SVM

classmethod load_data(data_file)

Load data from csv file

Parameters: data_file (str) – Path to csv file storing texts and labels
Returns: df – Pandas dataframe of texts and labels
Return type: pd.DataFrame

model_test()

Test model on some key phrases

Parameters: model (ModerationModel) –

predict(X, verbose=False)

Predict classification

Parameters

X (ndarray | list | pd.DataFrame) – Set of texts to classify
verbose (bool) – Whether to show progress bar for predictions

Returns

List of predicted classifications for input texts

Return type

list

predict_one(X, verbose=False)

Predict probability of label=1

Parameters

X (ndarray | list | pd.DataFrame) – Set of texts to classify
verbose (bool) – Whether to show progress bar for predictions

Returns

List of predicted probability for label=1 for input texts

Return type

list

predict_proba(X, verbose=False)[source]

Predict classification

Parameters

X (ndarray | list | pd.DataFrame) – Set of texts to classify
verbose (bool) – Has no effect. For compliance with LSTM method

Returns

List of predicted classifications for input texts

Return type

list

predict_zero(X, verbose=False)

Predict probability of label=0

Parameters

X (ndarray | list | pd.DataFrame) – Set of texts to classify
verbose (bool) – Whether to show progress bar for predictions

Returns

List of predicted probability for label=0 for input texts

Return type

list

classmethod run(data_file, config)

Run model pipeline. Load data, tokenize texts, and train

Parameters

data_file (str) – Path to csv file storing texts and labels
config (RunConfig) – Config class with kwargs

Returns

Trained and evaluated keras model or svm

Return type

ModerationModel

save(outpath)

Save model

Parameters: outpath (str) – Path to save model

static save_params(outpath, kwargs)

Save params to model path

Parameters

outpath (str) – Path to model
kwargs (dict) – Dictionary of kwargs used to build model

score(X, Y)

Score model against targets

Parameters

X (pd.DataFrame) – Pandas dataframe of texts
Y (pd.DataFrame) – Pandas dataframe of labels for the corresponding texts

Returns

Value of accuracy calulated from the correct predictions vs base truth

Return type

float

classmethod split_data(df, test_split=0.1)

Split data into training and test sets

Parameters

df (pd.DataFrame) – Pandas dataframe of texts and labels
test_split (float) – Fraction of full dataset to use for test data

Returns

df_train (pd.DataFrame) – Pandas dataframe of texts and labels for training
df_test (pd.DataFrame) – Pandas dataframe of texts and labels for testing

train(train_gen, test_gen=None, **kwargs)[source]

Train model

Parameters

train_gen (WeightedGenerator) – WeightedGenerator instance used for training batches
test_gen (WeightedGenerator) – Has no effect. For compliance with LSTM train method
kwargs (dict) – Has no effect. For compliance with LSTM train method

transform(X)[source]

Transform texts

Parameters: X (list | ndarray | pd.DataFrame) – Set of texts to transform before sending to model
Returns: X – Set of transformed texts ready to send to model
Return type: list | ndarray | pd.DataFrame