Blog Page

A dive into Transformers for language translation

Transformers Language Translation Attention Mechanism Pytorch

The last time I did something serious in natural language processing was during my days at RKMVERI, probably for selecting a project for my ML-1 course, back in 2020. During that time, I thought language generation was a cool thing, and people back then were using all sorts of RNNs, LSTMs, Bidirectional LSTMs, and whatnot to generate/summarize texts and even for sentence completion tasks. I didn't have a fascination for NLP tasks since the ideas were not natural to me. If I had taken a linguistics course, I would have taken a keen interest in it. The only interesting thing that came to my mind was story generation. I didn't have the requisite knowledge to make this idea into reality, since I was starting to read ML back at that time, but I was fascinated by the way the human mind compresses data and spits out interesting, and sometimes randomly connected stories during REM sleep. I thought, if I have some 4-5 keywords and I would generate some nearest token to connect those keywords, and have 3-4 passes to interpolate more interesting ideas inside it, I can perhaps come up with a story generation mechanism. Later, when I discussed this thing with my ML-1 mentor, he said that many people at the Ivy League are trying to do this, so he didn't recommend I invest time in something that will take more time than the completion of the course project, since I could afford about 3-4 months for a course project. So, I worked on a simple task, i.e., song generation. After working for 2-3 days, collecting some Keras code from Kaggle and piecing together a pipeline to do so, after repeated failed attempts, I was able to train a bidirectional LSTM to complete a line of a song, and generate about 1024 characters more. The moment I completed the training and gave some randomly rhyming words for completion, it generated complete unrelated rhyming garbage, with a lot of slang. I still remember giving it a prompt to complete a song. I started with: “I was a little boy” and the output it generated was: “I was a little boy, she was a little bitch ...” followed by those slurs repeated again and again. I thought this was doing something which I didn't expect it to do, like, this is generating curse word songs, and I wouldn't be able to present this to my mentor, so I gave up on this project and started working on image segmentation in the medical domain. Later I found out that the dataset I used for training had a lot of curse words in it, and the model just learned to mimic the dataset. I was a bit disappointed, but I thought that this is how things work in ML, and I should have been more careful while selecting the dataset, but it was too late, and I have already invested a lot of time in medical image segmentation

This was the last time I touched NLP code seriously, and after that, now, I am thinking of catching up with all those things that I have ditched back then. After the invention of LLMs for chatting (like ChatGPT, etc) and after hearing Geoffrey Hinton's and Illya Stuskever's discussions, I thought LLMs might think the way humans think, that they might be a bit conscious, but after working with transformers for a while, I think that is complete nonsense. The idea of even thinking that something like language might someday wake up consciousness in a machine feels disgusting. These models, which we call LLMs, are essentially Ogre, eating and compressing a lot of data into their latent manifold, and spitting out interpolated, sometimes weird, unrelated mispredicted stuff (also called hallucination in modern day) for a query. This works in most of the usual and mundane tasks at hand, but I don't think that you can have something like mind/consciousness/reasoning in machines. They might simulate it, but not like humans, who actually reason to find solutions to complex problems. Actually, I don't even think that reasoning/consciousness can be represented in some mathematical representations. They are just something that can't be done algorithmically (which is essentially some form of mathematical construct). I completely disagree with the idea that I thought in my previous blog, but this again can be debated endlessly.

So, with all these background yapping, let's dive into the pytorch code, where we talk about the transformers paper, implement a code for translation and break down some essential components.