R Tutorial Tokenization

0 Views

20 Sep 2021

Want to learn more Take the full course at a https 3A 2F 2Flearn datacamp com 2Fcourses 2Fintroduction-to-natural-language-processing-in-r a at your own pace More than a video you'll learn hands-on coding quickly apply skills to your daily work --- Now that we have looked at a basic way to search text let's move on to a fundamental component of text preprocessing tokenization Tokenization is the act of splitting text into individual tokens Tokens can be as small as individual characters or as large as the entire text document The most common types of tokens are characters words sentences documents and even separating text into tokens based on a regular expression For example splitting text every time you see a 3 digit or larger number R has an abundance of ways to tokenize text but we will use the tidytext package - which describes itself as Text Mining using 'dplyr' 'ggplot2' and Other Tidy Tools The tidytext package follows the tidy data format Taking the introduction to the Tidyverse course may be helpful if you are new to the tidy concepts Throughout this course we are going to use a couple of different datasets The first being the 10 chapters from the book Animal Farm This is a great dataset for our course Although our data is limited to just the text and the chapter number it has a rich character list themes that repeat themselves and simple vocabulary for us to explore The tidytext function for tokenization is called unnest tokens This function takes our input tibble called animal farm and extracts tokens from the column specified by the input argument We also specify what kind of tokens we want and what the output column should be labeled Our tokenization options include sentences lines regex for a user-specified regular expression and many others We can take this a step further by quickly counting the top tokens by simply adding the count function to the end of our code Not the most interesting output yet but we will clean this up later The most common words are just common English words such as the and of and to Another use of unnest tokens is to simply find all mentions of a particular word and to see what follows it In Animal Farm Boxer is one of the main characters Let's see what chapter one says about him Here we have filtered animal farm to chapter 1 and looked for any mention of Boxer regardless of Boxer being capitalized or not Since the first token starts at the beginning of the text I am using the slice function to skip the first token The output is the text that follows every mention of Boxer Who apparently was an enormous beast at nearly eighteen hands high Tokenizing text is a vital component to several text analysis tasks Let's practice with a few examples R RTutorial DataCamp Natural Language Processing Tokenization