modbot.training.models.TorchModel

class modbot.training.models.TorchModel(checkpoint=None, embed_size=None, lr=None, device='gpu')[source]

Bases: NNmodel

Base torch model

Methods

`append_history`(history)	Append history if continuing training or define history attribute
`batch_update`(batch, train_loss, scheduler)	Go through single batch pass and update loop
`build_layers`(embed_size)	Build model layers
`chunk_words`(text)	Create char ngrams from word
`clean_text`(text)	Clean single text so it is utf-8 compliant
`clean_texts`(X)	Clean texts so they are utf-8 compliant
`construct_vocab`(encoder, texts)	Construct vocab with encoder
`continue_training`(data_file, config)	Continue training.
`detailed_score`([test_gen, n_matches, out_dir])	Score model and print confusion matrix and multiple other metrics
`eval_score`(df_scores)	Evaluation score to determine whether to save model
`evaluate`(dev_dataloader, epoch, loss_fn)	Evaluate model on test data
`fit`(train_gen, test_gen[, epochs])	Fit model
`get_class_info`()	Get info about class numbers
`get_data_generators`(data_file, **kwargs)	Get data generators with correct sizes
`get_most_common_ngrams`(vocab[, n_min, n_max])	Get most common of each ngram
`get_ngrams`(text, n_min, n_max)	Create char n grams from text
`get_vocab_difference`(texts)	Get difference in direct vocab and adapted vocab
`get_vocab_direct`(texts[, n_min, n_max, ...])	Get vocab direct from texts
`load`(inpath, **kwargs)	Load pytorch model
`load_checkpoint`(checkpoint)	Load model from checkpoint
`load_data`(data_file)	Load data from csv file
`loss_fn`(preds, truth)	Loss function
`model_test`()	Test model on some key phrases
`predict`(X[, verbose])	Predict classification
`predict_one`(X[, verbose])	Predict probability of label=1
`predict_proba`(X[, verbose, batch_size])	Make prediction on input texts
`predict_zero`(X[, verbose])	Predict probability of label=0
`print_eval`(accr)	Log evaluation
`run`(data_file, config)	Run model pipeline.
`save`(outpath)	Save model
`save_check`(test_gen, model_path)	Evaluate model and check if eval merits saving
`save_params`(outpath, kwargs)	Save params to model path
`score`(X, Y)	Score model against targets
`split_data`(df[, test_split])	Split data into training and test sets
`split_words`(text)	Split text into words
`standardize_grams`(grams)	Standarize vocab.
`train`(train_gen, test_gen, **kwargs)	Train pytorch model
`transform`(X[, Y, batch_size])	returns dataloader for input to training or prediction methods

Attributes

`DLOADER_ARGS`
`EMBEDDING_DIM`	Embedding dimension size for embdedding layer
`EMBED_SIZE`
`LEARNING_RATE`
`LOSS_FUNCTION`
`MAX_NB_WORDS`	Max vocab size for tokenizer
`MAX_SEQUENCE_LENGTH`	Max sequence length for tokenizer output
`SEED`

EMBEDDING_DIM = 100: Embedding dimension size for embdedding layer

MAX_NB_WORDS = 50000: Max vocab size for tokenizer

MAX_SEQUENCE_LENGTH = 64: Max sequence length for tokenizer output

append_history(history)

Append history if continuing training or define history attribute

Parameters: history (dict) – Dictionary containing training history

batch_update(batch, train_loss, scheduler)[source]: Go through single batch pass and update loop

abstract build_layers(embed_size)[source]: Build model layers

static chunk_words(text)

Create char ngrams from word

Parameters: text (str) – Text string for which to compute ngrams
Returns: grams – Dictionary of grams with values as gram count
Return type: dict

static clean_text(text)

Clean single text so it is utf-8 compliant

Parameters: text (str) – Text string to clean
Returns: Cleaned text string
Return type: str

classmethod clean_texts(X)

Clean texts so they are utf-8 compliant

Parameters: X (pd.DataFrame) – Pandas dataframe of texts
Returns: X – Pandas dataframe of cleaned texts
Return type: pd.DataFrame

static construct_vocab(encoder, texts)

Construct vocab with encoder

Parameters

encoder (layers.TextVectorization) – TextVectorization layer used to build network
texts (list | ndarray | pd.DataFrame) – Set of texts used to build vocabulary

Returns

encoder – TextVectorization layer used to build network which has been adapted to get vocab

Return type

layers.TextVectorization

classmethod continue_training(data_file, config)

Continue training. Load model, load data, tokenize texts, and train.

Parameters

data_file (str) – Path to csv file storing texts and labels
config (RunConfig) – Config class with kwargs

Returns

Trained sequential and evaluated model

Return type

keras.Sequential

detailed_score(test_gen=None, n_matches=10, out_dir=None)

Score model and print confusion matrix and multiple other metrics

Parameters

test_gen (WeightedGenerator) – generator for test data
n_matches (int) – Number of positive matches to print
out_dir (str | None) – Path to save scores

Returns

df_scores – A dataframe containing all model scores

Return type

pd.DataFrame

static eval_score(df_scores)[source]: Evaluation score to determine whether to save model

evaluate(dev_dataloader, epoch, loss_fn)[source]: Evaluate model on test data

fit(train_gen, test_gen, epochs=5)

Fit model

Parameters

train_gen (WeightedGenerator) – WeightedGenerator instance used for training batches
test_gen (WeightedGenerator) – WeightedGenerator instance used for evaluation batches
epochs (int) – Number of epochs to train model

Returns

Dictionary of training history

Return type

dict

get_class_info(): Get info about class numbers

classmethod get_data_generators(data_file, **kwargs)

Get data generators with correct sizes

Parameters

data_file (str) – Path to csv file storing texts and labels
kwargs (dict) – Dictionary with optional keyword parameters. Can include sample_size, batch_size, epochs, n_batches.

Returns

train_gen (WeightedGenerator) – WeightedGenerator instance used for training batches
test_gen (WeightedGenerator) – WeightedGenerator instance used for evaluation batches

static get_most_common_ngrams(vocab, n_min=None, n_max=None)

Get most common of each ngram

Parameters

vocab (list) – List of ngrams which have been built previously
n_min (int | None) – Minimum size of ngram used when building vocab
m_max (int | None) – Maximum size of ngram used when building vocab

static get_ngrams(text, n_min, n_max)

Create char n grams from text

Parameters

text (str) – Text string for which to compute ngrams
n_min (int | None) – Minimum size of ngram used when building vocab
m_max (int | None) – Maximum size of ngram used when building vocab

Returns

grams – Dictionary of grams with values as gram count

Return type

dict

get_vocab_difference(texts)

Get difference in direct vocab and adapted vocab

Parameters: texts (list | ndarray | pd.DataFrame) – Set of texts used to build vocabulary

classmethod get_vocab_direct(texts, n_min=None, n_max=None, chunk_words=False)

Get vocab direct from texts

textslist | ndarray | pd.DataFrame: List of texts for which to compute vocab
n_minint | None: Minimum size of ngram used when building vocab
m_maxint | None: Maximum size of ngram used when building vocab
chunk_wordsbool: Whether to compute ngrams on individual words

Returns: List of words or ngrams
Return type: list

classmethod load(inpath, **kwargs)[source]: Load pytorch model

load_checkpoint(checkpoint)[source]: Load model from checkpoint

classmethod load_data(data_file)

Load data from csv file

Parameters: data_file (str) – Path to csv file storing texts and labels
Returns: df – Pandas dataframe of texts and labels
Return type: pd.DataFrame

loss_fn(preds, truth)[source]: Loss function

model_test()

Test model on some key phrases

Parameters: model (ModerationModel) –

predict(X, verbose=False)

Predict classification

Parameters

X (ndarray | list | pd.DataFrame) – Set of texts to classify
verbose (bool) – Whether to show progress bar for predictions

Returns

List of predicted classifications for input texts

Return type

list

predict_one(X, verbose=False)

Predict probability of label=1

Parameters

X (ndarray | list | pd.DataFrame) – Set of texts to classify
verbose (bool) – Whether to show progress bar for predictions

Returns

List of predicted probability for label=1 for input texts

Return type

list

predict_proba(X, verbose=False, batch_size=128)[source]: Make prediction on input texts

predict_zero(X, verbose=False)

Predict probability of label=0

Parameters

X (ndarray | list | pd.DataFrame) – Set of texts to classify
verbose (bool) – Whether to show progress bar for predictions

Returns

List of predicted probability for label=0 for input texts

Return type

list

static print_eval(accr)

Log evaluation

Parameters: accr (list) – List of evaluation results

classmethod run(data_file, config)

Run model pipeline. Load data, tokenize texts, and train

Parameters

data_file (str) – Path to csv file storing texts and labels
config (RunConfig) – Config class with kwargs

Returns

Trained and evaluated keras model or svm

Return type

ModerationModel

save(outpath)[source]: Save model

save_check(test_gen, model_path)[source]: Evaluate model and check if eval merits saving

static save_params(outpath, kwargs)

Save params to model path

Parameters

outpath (str) – Path to model
kwargs (dict) – Dictionary of kwargs used to build model

score(X, Y)

Score model against targets

Parameters

X (pd.DataFrame) – Pandas dataframe of texts
Y (pd.DataFrame) – Pandas dataframe of labels for the corresponding texts

Returns

Value of accuracy calulated from the correct predictions vs base truth

Return type

float

classmethod split_data(df, test_split=0.1)

Split data into training and test sets

Parameters

df (pd.DataFrame) – Pandas dataframe of texts and labels
test_split (float) – Fraction of full dataset to use for test data

Returns

df_train (pd.DataFrame) – Pandas dataframe of texts and labels for training
df_test (pd.DataFrame) – Pandas dataframe of texts and labels for testing

static split_words(text)

Split text into words

Parameters: text (str) – Text string for which to compute ngrams
Returns: grams – Dictionary of grams with values as gram count
Return type: dict

static standardize_grams(grams)

Standarize vocab. Lower and remove punctuation

Parameters: grams (dict) – Dictionary of grams with values as gram count
Returns: clean_grams – Dictionary of standardized grams with values as gram count
Return type: dict

train(train_gen, test_gen, **kwargs)[source]: Train pytorch model

transform(X, Y=None, batch_size=None)[source]: returns dataloader for input to training or prediction methods