After reading this article you’ll be able to:
Machine Learning Engineers and Data Scientists already have a handful of tools…
Imagine you’ve been tasked with the activity of building a Sentiment Analysis tool for your company product reviews. As a seasoned Data Scientist, you built many insights about future sale predictions and was even able to classify customers based on their purchase behavior.
But now, you’re intrigued: you have this bunch of text entries and have to turn them into features for a Machine Learning model. How can that be done? That’s a common question when Data Scientists meet text for the first time.
As simple as it may look for experienced NLP Data Scientists, turning text into features is…
In the last few articles we spent some time explaining and implementing some of the most important preprocessing techniques in NLP. However, we played too little with real text situations. Now it is the time to work a little with that.
We talked about Text Normalization in the article about stemming. However, stemming is not the most important (and even used) task in Text Normalization. We also went on into some other Normalization techniques earlier, such as Tokenization, Sentencizing and Lemmatization. …
If you’re into NLP, you probably stumbled over a dozen tools that have this neat feature named “lemmatization”. In this article, I’ll do my best to guide you into what is Lemmatization, why is it useful and how can we build a Lemmatizer!
If you’re coming from my previous article onto how to make a PoS Tagger, you’ve already grasped the important prerequisites to do Lemmatization. If not, I’ll gently present them through the length of this article, so let’s get started!
Time to dive a little deeper onto grammar.
In this article, following the series on NLP, we’ll understand and create a Part of Speech (PoS) Tagger. The idea is to be able to extract “hidden” information from our text and also enable future use of Lemmatization, a text normalization tool that depends on PoS tags for correction.
In this article, we’ll use some more advanced topics, such as Machine Learning algorithms and some stuff about grammar and syntax. …
It is time to talk about stems.
Stems are the main body or stalk of a plant or shrub, typically rising above ground but occasionally subterranean. Well, that’s what google says, and it is right!
But here we’re going to talk about Word Stems. If you’re coming from my previous articles, you know that this is an optional step in the NLP Preprocessing Pipeline.
In this article we’ll implement the Porter Stemmer, probably the most famous algorithm for stemming out there, created by Martin Porter in 1979 (yes, it is old!). …
So we’ve learned about the many distinct steps that a Preprocessing Pipeline can take. If you’re coming from my previous article (or a NLP class), you probably have a general idea about what is Tokenization and what does it serve for. The purpose of this article is to help better clarify what is Tokenization, how it works and, most importantly, implement a Tokenizer.
If you came from my previous article, you might also be wondering about what happened to the “Bare String Preprocessing” step. …
This article is part on a series that aims to clarify the most important details of NLP. You can refer to the main article here.
After some story, we get to see when and why to apply NLP. In this track, there’s an important concept called “preprocessing” — one that is common to any area of Data Science (you want your data to get neat and clean, right?).
But, while in numerical data you’ll usually apply some normalization rules (reduce difference between max and min values), drop and fill NaNs (that means empty values) and detect outliers (points out of…
Natural Language Processing (NLP) is probably one of the most turbulent fields today under the Computer Sciences umbrella.
While it is not something new, technological advancements, new algorithms and data abundance made the possibility of getting computers to read/write and listen/speak something almost mundane (not to mention the attempts to make computers really understand what is written — which is the deal of Natural Language Understanding — NLU).
This story is the starting point for a series proposing to present the field of Natural Language Processing without being neither too math-tech-bot nor too languages-theories-worm. …
So you’re into Natural Language Processing — NLP (not to mistake with Neuro Linguistic Programming, whatever that is, it keeps appearing in my search results…).
Its 1950, the Mind Magazine. One prominent scientist (mathematician) with name Alan Turing, after a long discussion on theoretical ways to make a machine learn, wrote the following words:
it is best to provide the machine with the best sense organs that money can buy, and then teach it to understand and speak English.[1]
And there was born NLP. Okay, not so abruptly, but the idea was there. …
A Data Scientist passionate about data and text. Trying to understand and clearly explain all important nuances of Natural Language Processing.