modbot.training.models.BertCnnLstm
- class modbot.training.models.BertCnnLstm(texts=None, clf=None, vec=None, **kwargs)[source]
 Bases:
BERTModel with Bert, CNN, and LSTM in that order
Methods
append_history(history)Append history if continuing training or define history attribute
build_layers(**kwargs)Build model layers
chunk_words(text)Create char ngrams from word
clean_text(text)Clean single text so it is utf-8 compliant
clean_texts(X)Clean texts so they are utf-8 compliant
combo_layer(outputs, k_size)Combination layer
construct_vocab(encoder, texts)Construct vocab with encoder
continue_training(data_file, config)Continue training.
detailed_score([test_gen, n_matches, out_dir])Score model and print confusion matrix and multiple other metrics
evaluate(test_gen)Evaluate model
fit(train_gen, test_gen[, epochs])Fit model
Get info about class numbers
get_data_generators(data_file, **kwargs)Get data generators with correct sizes
get_most_common_ngrams(vocab[, n_min, n_max])Get most common of each ngram
get_ngrams(text, n_min, n_max)Create char n grams from text
get_vocab_difference(texts)Get difference in direct vocab and adapted vocab
get_vocab_direct(texts[, n_min, n_max, ...])Get vocab direct from texts
init_bert(**kwargs)Initialize BERT
load(inpath, **kwargs)Load model
load_data(data_file)Load data from csv file
Test model on some key phrases
predict(X[, verbose])Predict classification
predict_one(X[, verbose])Predict probability of label=1
predict_proba(X[, verbose])Predict probability
predict_zero(X[, verbose])Predict probability of label=0
print_eval(accr)Log evaluation
run(data_file, config)Run model pipeline.
save(outpath)Save model
save_params(outpath, kwargs)Save params to model path
score(X, Y)Score model against targets
split_data(df[, test_split])Split data into training and test sets
split_words(text)Split text into words
standardize_grams(grams)Standarize vocab.
train(train_gen, test_gen, **kwargs)Train model and evaluate
transform(X)Transform texts
Attributes
Embedding dimension size for embdedding layer
Max vocab size for tokenizer
Max sequence length for tokenizer output
- EMBEDDING_DIM = 100
 Embedding dimension size for embdedding layer
- MAX_NB_WORDS = 50000
 Max vocab size for tokenizer
- MAX_SEQUENCE_LENGTH = 250
 Max sequence length for tokenizer output
- append_history(history)
 Append history if continuing training or define history attribute
- Parameters
 history (dict) – Dictionary containing training history
- build_layers(**kwargs)[source]
 Build model layers
- Parameters
 kwargs (dict) – Dictionary of config parameters
- Returns
 Keras model
- Return type
 keras.Model
- static chunk_words(text)
 Create char ngrams from word
- Parameters
 text (str) – Text string for which to compute ngrams
- Returns
 grams – Dictionary of grams with values as gram count
- Return type
 dict
- static clean_text(text)
 Clean single text so it is utf-8 compliant
- Parameters
 text (str) – Text string to clean
- Returns
 Cleaned text string
- Return type
 str
- classmethod clean_texts(X)
 Clean texts so they are utf-8 compliant
- Parameters
 X (pd.DataFrame) – Pandas dataframe of texts
- Returns
 X – Pandas dataframe of cleaned texts
- Return type
 pd.DataFrame
- static combo_layer(outputs, k_size)[source]
 Combination layer
- Parameters
 outputs (list) – List of tensors from BERT
k_size (int) – Size of convolution kernel
- Returns
 out – Output of layer
- Return type
 tf.Tensor
- static construct_vocab(encoder, texts)
 Construct vocab with encoder
- Parameters
 encoder (layers.TextVectorization) – TextVectorization layer used to build network
texts (list | ndarray | pd.DataFrame) – Set of texts used to build vocabulary
- Returns
 encoder – TextVectorization layer used to build network which has been adapted to get vocab
- Return type
 layers.TextVectorization
- classmethod continue_training(data_file, config)
 Continue training. Load model, load data, tokenize texts, and train.
- Parameters
 data_file (str) – Path to csv file storing texts and labels
config (RunConfig) – Config class with kwargs
- Returns
 Trained sequential and evaluated model
- Return type
 keras.Sequential
- detailed_score(test_gen=None, n_matches=10, out_dir=None)
 Score model and print confusion matrix and multiple other metrics
- Parameters
 test_gen (WeightedGenerator) – generator for test data
n_matches (int) – Number of positive matches to print
out_dir (str | None) – Path to save scores
- Returns
 df_scores – A dataframe containing all model scores
- Return type
 pd.DataFrame
- evaluate(test_gen)
 Evaluate model
- Parameters
 test_gen (WeightedGenerator) – WeightedGenerator instance used for model evaluation
- Returns
 List of evaluation results
- Return type
 list
- fit(train_gen, test_gen, epochs=5)
 Fit model
- Parameters
 train_gen (WeightedGenerator) – WeightedGenerator instance used for training batches
test_gen (WeightedGenerator) – WeightedGenerator instance used for evaluation batches
epochs (int) – Number of epochs to train model
- Returns
 Dictionary of training history
- Return type
 dict
- get_class_info()
 Get info about class numbers
- classmethod get_data_generators(data_file, **kwargs)
 Get data generators with correct sizes
- Parameters
 data_file (str) – Path to csv file storing texts and labels
kwargs (dict) – Dictionary with optional keyword parameters. Can include sample_size, batch_size, epochs, n_batches.
- Returns
 train_gen (WeightedGenerator) – WeightedGenerator instance used for training batches
test_gen (WeightedGenerator) – WeightedGenerator instance used for evaluation batches
- static get_most_common_ngrams(vocab, n_min=None, n_max=None)
 Get most common of each ngram
- Parameters
 vocab (list) – List of ngrams which have been built previously
n_min (int | None) – Minimum size of ngram used when building vocab
m_max (int | None) – Maximum size of ngram used when building vocab
- static get_ngrams(text, n_min, n_max)
 Create char n grams from text
- Parameters
 text (str) – Text string for which to compute ngrams
n_min (int | None) – Minimum size of ngram used when building vocab
m_max (int | None) – Maximum size of ngram used when building vocab
- Returns
 grams – Dictionary of grams with values as gram count
- Return type
 dict
- get_vocab_difference(texts)
 Get difference in direct vocab and adapted vocab
- Parameters
 texts (list | ndarray | pd.DataFrame) – Set of texts used to build vocabulary
- classmethod get_vocab_direct(texts, n_min=None, n_max=None, chunk_words=False)
 Get vocab direct from texts
- textslist | ndarray | pd.DataFrame
 List of texts for which to compute vocab
- n_minint | None
 Minimum size of ngram used when building vocab
- m_maxint | None
 Maximum size of ngram used when building vocab
- chunk_wordsbool
 Whether to compute ngrams on individual words
- Returns
 List of words or ngrams
- Return type
 list
- init_bert(**kwargs)
 Initialize BERT
- Parameters
 kwargs (dict) – Dictionary of config parameters
- Returns
 bert_preprocess (hub.KerasLayer) – BERT tokenizer to be used in network
bert_encoder (hub.KerasLayer) – BERT encoder to be used in network
- classmethod load(inpath, **kwargs)
 Load model
- Parameters
 inpath (str) – Path from which to load model
- Return type
 Initialized NNmodel model
- classmethod load_data(data_file)
 Load data from csv file
- Parameters
 data_file (str) – Path to csv file storing texts and labels
- Returns
 df – Pandas dataframe of texts and labels
- Return type
 pd.DataFrame
- model_test()
 Test model on some key phrases
- Parameters
 model (ModerationModel) –
- predict(X, verbose=False)
 Predict classification
- Parameters
 X (ndarray | list | pd.DataFrame) – Set of texts to classify
verbose (bool) – Whether to show progress bar for predictions
- Returns
 List of predicted classifications for input texts
- Return type
 list
- predict_one(X, verbose=False)
 Predict probability of label=1
- Parameters
 X (ndarray | list | pd.DataFrame) – Set of texts to classify
verbose (bool) – Whether to show progress bar for predictions
- Returns
 List of predicted probability for label=1 for input texts
- Return type
 list
- predict_proba(X, verbose=False)
 Predict probability
- Parameters
 X (ndarray | list | pd.DataFrame) – Set of texts to classify
verbose (bool) – Whether to show progress bar for predictions
- Returns
 List of probabilities of having label=1 for input texts
- Return type
 list
- predict_zero(X, verbose=False)
 Predict probability of label=0
- Parameters
 X (ndarray | list | pd.DataFrame) – Set of texts to classify
verbose (bool) – Whether to show progress bar for predictions
- Returns
 List of predicted probability for label=0 for input texts
- Return type
 list
- static print_eval(accr)
 Log evaluation
- Parameters
 accr (list) – List of evaluation results
- classmethod run(data_file, config)
 Run model pipeline. Load data, tokenize texts, and train
- Parameters
 data_file (str) – Path to csv file storing texts and labels
config (RunConfig) – Config class with kwargs
- Returns
 Trained and evaluated keras model or svm
- Return type
 
- save(outpath)
 Save model
- Parameters
 outpath (str) – Path to save model
- static save_params(outpath, kwargs)
 Save params to model path
- Parameters
 outpath (str) – Path to model
kwargs (dict) – Dictionary of kwargs used to build model
- score(X, Y)
 Score model against targets
- Parameters
 X (pd.DataFrame) – Pandas dataframe of texts
Y (pd.DataFrame) – Pandas dataframe of labels for the corresponding texts
- Returns
 Value of accuracy calulated from the correct predictions vs base truth
- Return type
 float
- classmethod split_data(df, test_split=0.1)
 Split data into training and test sets
- Parameters
 df (pd.DataFrame) – Pandas dataframe of texts and labels
test_split (float) – Fraction of full dataset to use for test data
- Returns
 df_train (pd.DataFrame) – Pandas dataframe of texts and labels for training
df_test (pd.DataFrame) – Pandas dataframe of texts and labels for testing
- static split_words(text)
 Split text into words
- Parameters
 text (str) – Text string for which to compute ngrams
- Returns
 grams – Dictionary of grams with values as gram count
- Return type
 dict
- static standardize_grams(grams)
 Standarize vocab. Lower and remove punctuation
- Parameters
 grams (dict) – Dictionary of grams with values as gram count
- Returns
 clean_grams – Dictionary of standardized grams with values as gram count
- Return type
 dict
- train(train_gen, test_gen, **kwargs)
 Train model and evaluate
- Parameters
 train_gen (WeightedGenerator) – WeightedGenerator instance used for training batches
test_gen (WeightedGenerator) – WeightedGenerator instance used for evaluation batches
kwargs (dict) – Dictionary with optional keyword parameters. Can include sample_size, batch_size, epochs, n_batches.
- transform(X)
 Transform texts
- Parameters
 X (list | ndarray | pd.DataFrame) – Set of texts to transform before sending to model
- Returns
 X – Set of transformed texts ready to send to model
- Return type
 list | ndarray | pd.DataFrame