BookClassifications By ML_Part 2_FeatureEngineering byAutoEncoder
This part is about one technique in NLP feature engineering — AutoEncoder.
When classifying the book, we need to deal with the book descriptions and get features. One of the sequence data features engineering techniques is encoder_decoder architecture.
Here we implement the autoencoder using PyTorch.
- Build a vocabulary for the text data. Transform words into numbers and vice versa. It needs to implement the functions word_to_index and index_to_word.
- Process the data in the dataset format that can feed to PyTorch.It needs to implement the functions:__len__, __getitem_ and collate_fn().
- Build an encoder module using GRU. It needs to implement forward functions with GRU, Linear, and Activation layers. Its hidden state needs to transfer to the decoder module.
- Build a decoder module using GRU. It takes the input which is a time step later than the encoder module input and the encoder output hidden states. It needs to implement forward functions with GRU, Linear, and Activation layers.
- Connect the encoder and decoder by the hidden states.
- Implement the sequence mask loss function for the model. As the sequence data has padded the same length. It doesn’t need to calculate the loss of the padded.
- Train the encoder_decoder model. The model’s input and output are almost the same except the outputs are a one-time step later than the input. It uses teaching forcing here.
The code details can be checked here:
GitHub - mcyhx/auto_encoder_for_feature_engineering: This is autoencoder of feature engineering for…
This is an autoencoder of feature engineering for Chinese book classification with PyTorch. Check how to build a vocabulary…
- Don’t forget to process the data format the same as the training data format when generating the feature. Or the data format processing function can be wrapped up in the predict function in the customized model class.
- It needs to import the encoder class definition and vocab class definition to reconstruct the object before using the encoder and the vocab which is loaded from where they are saved.
- For the sequence data, it needs to rearrange the batch dimension and the time step dimension, using permute() function here.