PyTorch and Tensorflow in Natural Language Processing Pipeline_Data Preprocessing

4 min readApr 14, 2022

The first part is about preprocessing text data in NLP.

1 . When doing natural language processing, the first thing we need to do is read documents’ data from the file and clean the documents’ data. (such as stripping the unnecessary space, substituting the characters, or normalizing the punctuations by basic python code.) This step for PyTorch and TensorFlow is almost the same.

2. The second thing to do is to tokenize the sentence ( for the English language is to use the split function but for the Chinese language it needs to use some specific library such as “Jieba” (https://github.com/fxsjy/jieba), build a vocabulary, which maps the words to the number and vise-verse, get the words’ frequency.

PyTorch

If use the module the library provided(torch.vocab), consult the PyTorch document for preprocessing here:

https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html

import torch
from torchtext.datasets import AG_NEWS
train_iter = iter(AG_NEWS(split='train'))
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

But if we need to build the vocabulary by ourselves in PyTorch. We can do it in the following steps:

We need to write a Vocabulary class which has the following part:

Initialize function : initialize a list (store the index_to_token) and a dictionary(store the token_to_index), add the <unk> token to the token_to_index and index_to_token.

from collections import defaultdict, Counter
class Vocab:    
def __init__(self, tokens=None):        
    self.idx_to_token = list()        
    self.token_to_idx = dict()         
    if tokens is not None:            
        if "<unk>" not in tokens:                
            tokens = tokens + ["<unk>"]            
        for token in tokens: 
            self.idx_to_token.append(token)
            self.token_to_idx[token] = len(self.idx_to_token) - 1
        self.unk = self.token_to_idx['<unk>']

2. Build the vocabulary from the text(by looping the sentences)

@classmethod    
def build(cls, text, min_freq=1, reserved_tokens=None):
    token_freqs = defaultdict(int)
    for sentence in text:
        for token in sentence:
            token_freqs[token] += 1
    uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens  \
                      else [])
    uniq_tokens += [token for token, freq in token_freqs.items() \
                      if freq >= min_freq and token != "<unk>"]
    return cls(uniq_tokens)

3. Convert tokens to ids and convert ids to tokens and implement the methods __len__ and __getitem__ that the PyTorch dataset is required.

def __len__(self):        
   return len(self.idx_to_token)def __getitem__(self, token):        
   return self.token_to_idx.get(token, self.unk)def convert_tokens_to_ids(self, tokens):        
    return [self[token] for token in tokens]def convert_ids_to_tokens(self, indices):        
    return [self.idx_to_token[index] for index in indices]

4. Save the vocab and load the vocab method

def save_vocab(vocab, path):    
    with open(path, 'w') as writer:
        writer.write("\n".join(vocab.idx_to_token))def read_vocab(path):
    with open(path, 'r') as f:
        tokens = f.read().split('\n')
    return Vocab(tokens)

We also need to customize a dataset class (inheriting from dataset class ) and dataloader class (inheriting from DataLoder class ) that are fit to PyTorch framework.

A simple example of customizing a dataset. Sometimes the dataset customizing is more complex than the example here. It depends on the dataset and the task.

import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from collections import defaultdict # the vocab class we write
from vocab import Vocabclass CnnDataset(Dataset):    def __init__(self, data):        
        self.data = data    def __len__(self):        
        return len(self.data)    def __getitem__(self, i):        
        return self.data[i]

2. A simple example of customizing a dataloader:

A necessary function to implement to pass to dataloader is the collate_fn function. Sometimes the collate_fn function is more complex than the example here. It depends on the dataset and the task.

def collate_fn(examples):    
    inputs = [torch.tensor(ex[0]) for ex in examples]    
    targets = torch.tensor([ex[1] for ex  ex in examples],\
               dtype=torch.long)  
    inputs = pad_sequence(inputs, batch_first=True)    
    return inputs, targetstrain_dataset = CnnDataset(train_data)
train_data_loader = DataLoader(train_dataset, \
                        batch_size=batch_size, \
                        collate_fn=collate_fn, \
                        shuffle=True)

The completed code example in preprocessing text data in PyTorch can be referred to here. Thanks very much for the code authors.

plm-nlp-code/vocab.py at main · HIT-SCIR/plm-nlp-code

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below…

github.com

plm-nlp-code/cnn_sent_polarity.py at main · HIT-SCIR/plm-nlp-code

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below…

github.com

TensorFlow:

We can consult the TensorFlow document for preprocessing here:

Module: tf.keras.preprocessing.text | TensorFlow Core v2.8.0

Public API for tf.keras.preprocessing.text namespace.

www.tensorflow.org

Module: tf.keras.preprocessing.sequence | TensorFlow Core v2.8.0

Public API for tf.keras.preprocessing.sequence namespace.

www.tensorflow.org

Use tokenize from tensorflow.keras.preprocessing.text

Use the fit method of the tokenizer to build the vocab and use texts_to_sequence of the tokenizer to get the tokens sequence(word index sequence).

Use pad_sequences tensorflow.keras.preprocessing.sequence to pad the sequences to a uniform length.

sentences = [
    'I love reading book and writing essay',
    'I plan to buy a new house',
    'You like to walk in the garden!',
    'Do you think it is interesting to play the game?'
]
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences #initialize the tokenizer object,and oov token is defined in the #argumentstokenizer = Tokenizer(num_words = 500,oov_token="<OOV>")#build a vocab
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_indexsequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, maxlen=15)

The completed code example in preprocessing text data in TensorFlow can be referred to here. Thanks very much for the code authors.

https://github.com/https-deeplearning-ai/tensorflow-1-public/tree/main/C3/W1/ungraded_labs

PyTorch and Tensorflow in Natural Language Processing Pipeline_Data Preprocessing

plm-nlp-code/vocab.py at main · HIT-SCIR/plm-nlp-code

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below…

plm-nlp-code/cnn_sent_polarity.py at main · HIT-SCIR/plm-nlp-code

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below…

Module: tf.keras.preprocessing.text | TensorFlow Core v2.8.0

Public API for tf.keras.preprocessing.text namespace.

Module: tf.keras.preprocessing.sequence | TensorFlow Core v2.8.0

Public API for tf.keras.preprocessing.sequence namespace.

Written by Slinae