modbot.training.models.BertLSTM

class modbot.training.models.BertLSTM(texts=None, clf=None, vec=None, **kwargs)[source]

Bases: BERT

Bert model with LSTM

Methods

append_history(history)

Append history if continuing training or define history attribute

build_layers(**kwargs)

Build model layers

chunk_words(text)

Create char ngrams from word

clean_text(text)

Clean single text so it is utf-8 compliant

clean_texts(X)

Clean texts so they are utf-8 compliant

construct_vocab(encoder, texts)

Construct vocab with encoder

continue_training(data_file, config)

Continue training.

detailed_score([test_gen, n_matches, out_dir])

Score model and print confusion matrix and multiple other metrics

evaluate(test_gen)

Evaluate model

fit(train_gen, test_gen[, epochs])

Fit model

get_class_info()

Get info about class numbers

get_data_generators(data_file, **kwargs)

Get data generators with correct sizes

get_most_common_ngrams(vocab[, n_min, n_max])

Get most common of each ngram

get_ngrams(text, n_min, n_max)

Create char n grams from text

get_vocab_difference(texts)

Get difference in direct vocab and adapted vocab

get_vocab_direct(texts[, n_min, n_max, ...])

Get vocab direct from texts

init_bert(**kwargs)

Initialize BERT

load(inpath, **kwargs)

Load model

load_data(data_file)

Load data from csv file

model_test()

Test model on some key phrases

predict(X[, verbose])

Predict classification

predict_one(X[, verbose])

Predict probability of label=1

predict_proba(X[, verbose])

Predict probability

predict_zero(X[, verbose])

Predict probability of label=0

print_eval(accr)

Log evaluation

run(data_file, config)

Run model pipeline.

save(outpath)

Save model

save_params(outpath, kwargs)

Save params to model path

score(X, Y)

Score model against targets

split_data(df[, test_split])

Split data into training and test sets

split_words(text)

Split text into words

standardize_grams(grams)

Standarize vocab.

train(train_gen, test_gen, **kwargs)

Train model and evaluate

transform(X)

Transform texts

Attributes

EMBEDDING_DIM

Embedding dimension size for embdedding layer

MAX_NB_WORDS

Max vocab size for tokenizer

MAX_SEQUENCE_LENGTH

Max sequence length for tokenizer output

EMBEDDING_DIM = 100

Embedding dimension size for embdedding layer

MAX_NB_WORDS = 50000

Max vocab size for tokenizer

MAX_SEQUENCE_LENGTH = 250

Max sequence length for tokenizer output

append_history(history)

Append history if continuing training or define history attribute

Parameters

history (dict) – Dictionary containing training history

build_layers(**kwargs)[source]

Build model layers

Parameters

kwargs (dict) – Dictionary of config parameters

Returns

Keras model

Return type

keras.Model

static chunk_words(text)

Create char ngrams from word

Parameters

text (str) – Text string for which to compute ngrams

Returns

grams – Dictionary of grams with values as gram count

Return type

dict

static clean_text(text)

Clean single text so it is utf-8 compliant

Parameters

text (str) – Text string to clean

Returns

Cleaned text string

Return type

str

classmethod clean_texts(X)

Clean texts so they are utf-8 compliant

Parameters

X (pd.DataFrame) – Pandas dataframe of texts

Returns

X – Pandas dataframe of cleaned texts

Return type

pd.DataFrame

static construct_vocab(encoder, texts)

Construct vocab with encoder

Parameters
  • encoder (layers.TextVectorization) – TextVectorization layer used to build network

  • texts (list | ndarray | pd.DataFrame) – Set of texts used to build vocabulary

Returns

encoder – TextVectorization layer used to build network which has been adapted to get vocab

Return type

layers.TextVectorization

classmethod continue_training(data_file, config)

Continue training. Load model, load data, tokenize texts, and train.

Parameters
  • data_file (str) – Path to csv file storing texts and labels

  • config (RunConfig) – Config class with kwargs

Returns

Trained sequential and evaluated model

Return type

keras.Sequential

detailed_score(test_gen=None, n_matches=10, out_dir=None)

Score model and print confusion matrix and multiple other metrics

Parameters
  • test_gen (WeightedGenerator) – generator for test data

  • n_matches (int) – Number of positive matches to print

  • out_dir (str | None) – Path to save scores

Returns

df_scores – A dataframe containing all model scores

Return type

pd.DataFrame

evaluate(test_gen)

Evaluate model

Parameters

test_gen (WeightedGenerator) – WeightedGenerator instance used for model evaluation

Returns

List of evaluation results

Return type

list

fit(train_gen, test_gen, epochs=5)

Fit model

Parameters
  • train_gen (WeightedGenerator) – WeightedGenerator instance used for training batches

  • test_gen (WeightedGenerator) – WeightedGenerator instance used for evaluation batches

  • epochs (int) – Number of epochs to train model

Returns

Dictionary of training history

Return type

dict

get_class_info()

Get info about class numbers

classmethod get_data_generators(data_file, **kwargs)

Get data generators with correct sizes

Parameters
  • data_file (str) – Path to csv file storing texts and labels

  • kwargs (dict) – Dictionary with optional keyword parameters. Can include sample_size, batch_size, epochs, n_batches.

Returns

  • train_gen (WeightedGenerator) – WeightedGenerator instance used for training batches

  • test_gen (WeightedGenerator) – WeightedGenerator instance used for evaluation batches

static get_most_common_ngrams(vocab, n_min=None, n_max=None)

Get most common of each ngram

Parameters
  • vocab (list) – List of ngrams which have been built previously

  • n_min (int | None) – Minimum size of ngram used when building vocab

  • m_max (int | None) – Maximum size of ngram used when building vocab

static get_ngrams(text, n_min, n_max)

Create char n grams from text

Parameters
  • text (str) – Text string for which to compute ngrams

  • n_min (int | None) – Minimum size of ngram used when building vocab

  • m_max (int | None) – Maximum size of ngram used when building vocab

Returns

grams – Dictionary of grams with values as gram count

Return type

dict

get_vocab_difference(texts)

Get difference in direct vocab and adapted vocab

Parameters

texts (list | ndarray | pd.DataFrame) – Set of texts used to build vocabulary

classmethod get_vocab_direct(texts, n_min=None, n_max=None, chunk_words=False)

Get vocab direct from texts

textslist | ndarray | pd.DataFrame

List of texts for which to compute vocab

n_minint | None

Minimum size of ngram used when building vocab

m_maxint | None

Maximum size of ngram used when building vocab

chunk_wordsbool

Whether to compute ngrams on individual words

Returns

List of words or ngrams

Return type

list

init_bert(**kwargs)

Initialize BERT

Parameters

kwargs (dict) – Dictionary of config parameters

Returns

  • bert_preprocess (hub.KerasLayer) – BERT tokenizer to be used in network

  • bert_encoder (hub.KerasLayer) – BERT encoder to be used in network

classmethod load(inpath, **kwargs)

Load model

Parameters

inpath (str) – Path from which to load model

Return type

Initialized NNmodel model

classmethod load_data(data_file)

Load data from csv file

Parameters

data_file (str) – Path to csv file storing texts and labels

Returns

df – Pandas dataframe of texts and labels

Return type

pd.DataFrame

model_test()

Test model on some key phrases

Parameters

model (ModerationModel) –

predict(X, verbose=False)

Predict classification

Parameters
  • X (ndarray | list | pd.DataFrame) – Set of texts to classify

  • verbose (bool) – Whether to show progress bar for predictions

Returns

List of predicted classifications for input texts

Return type

list

predict_one(X, verbose=False)

Predict probability of label=1

Parameters
  • X (ndarray | list | pd.DataFrame) – Set of texts to classify

  • verbose (bool) – Whether to show progress bar for predictions

Returns

List of predicted probability for label=1 for input texts

Return type

list

predict_proba(X, verbose=False)

Predict probability

Parameters
  • X (ndarray | list | pd.DataFrame) – Set of texts to classify

  • verbose (bool) – Whether to show progress bar for predictions

Returns

List of probabilities of having label=1 for input texts

Return type

list

predict_zero(X, verbose=False)

Predict probability of label=0

Parameters
  • X (ndarray | list | pd.DataFrame) – Set of texts to classify

  • verbose (bool) – Whether to show progress bar for predictions

Returns

List of predicted probability for label=0 for input texts

Return type

list

static print_eval(accr)

Log evaluation

Parameters

accr (list) – List of evaluation results

classmethod run(data_file, config)

Run model pipeline. Load data, tokenize texts, and train

Parameters
  • data_file (str) – Path to csv file storing texts and labels

  • config (RunConfig) – Config class with kwargs

Returns

Trained and evaluated keras model or svm

Return type

ModerationModel

save(outpath)

Save model

Parameters

outpath (str) – Path to save model

static save_params(outpath, kwargs)

Save params to model path

Parameters
  • outpath (str) – Path to model

  • kwargs (dict) – Dictionary of kwargs used to build model

score(X, Y)

Score model against targets

Parameters
  • X (pd.DataFrame) – Pandas dataframe of texts

  • Y (pd.DataFrame) – Pandas dataframe of labels for the corresponding texts

Returns

Value of accuracy calulated from the correct predictions vs base truth

Return type

float

classmethod split_data(df, test_split=0.1)

Split data into training and test sets

Parameters
  • df (pd.DataFrame) – Pandas dataframe of texts and labels

  • test_split (float) – Fraction of full dataset to use for test data

Returns

  • df_train (pd.DataFrame) – Pandas dataframe of texts and labels for training

  • df_test (pd.DataFrame) – Pandas dataframe of texts and labels for testing

static split_words(text)

Split text into words

Parameters

text (str) – Text string for which to compute ngrams

Returns

grams – Dictionary of grams with values as gram count

Return type

dict

static standardize_grams(grams)

Standarize vocab. Lower and remove punctuation

Parameters

grams (dict) – Dictionary of grams with values as gram count

Returns

clean_grams – Dictionary of standardized grams with values as gram count

Return type

dict

train(train_gen, test_gen, **kwargs)

Train model and evaluate

Parameters
  • train_gen (WeightedGenerator) – WeightedGenerator instance used for training batches

  • test_gen (WeightedGenerator) – WeightedGenerator instance used for evaluation batches

  • kwargs (dict) – Dictionary with optional keyword parameters. Can include sample_size, batch_size, epochs, n_batches.

transform(X)

Transform texts

Parameters

X (list | ndarray | pd.DataFrame) – Set of texts to transform before sending to model

Returns

X – Set of transformed texts ready to send to model

Return type

list | ndarray | pd.DataFrame