Python Tutorial Introduction to tokenization

0 Views

20 Sep 2021

Want to learn more Take the full course at a https 3A 2F 2Flearn datacamp com 2Fcourses 2Fintroduction-to-natural-language-processing-in-python a at your own pace More than a video you'll learn hands-on coding quickly apply skills to your daily work --- In this video we'll learn more about string tokenization Tokenization is the process of transforming a string or document into smaller chunks which we call tokens This is usually one step in the process of preparing a text for natural language processing There are many different theories and rules regarding tokenization and you can create your own tokenization rules using regular expresssions but normally tokenization will do things like break out words or sentences often separate punctuation or you can even just tokenize parts of a string like separating all hashtags in a Tweet One library that is commonly used for simple tokenization is nltk the natural language toolkit library Here is a short example of using the word tokenize method to break down a string into tokens We can see from the result that words are separated and punctuation are individual tokens as well Why bother with tokenization Because it can help us with some simple text processing tasks like mapping part of speech matching common words and perhaps removing unwanted tokens like common words or repeated words Here we have a good example The sentence is I don't like Sam's shoes When we tokenize it we can clearly see the negation in the not and we can see possession with the 's These indicators can help us determine meaning from simple text Beyond just tokenizing words NLTK has plenty of other tokenizers you can use including these ones you'll be working with in this chapter The sent tokenize function will split a document into individual sentences The regexp tokenize uses regular expressions to tokenize the string giving you more granular control over the process And the tweettokenizer does neat things like recognize hashtags mentions and when you have too many punctuation symbols following a sentence How convenient You'll be using more regex in this section as well not only when you are tokenizing but also figuring out how to parse tokens and text Using the regex module's re match and re search are pretty essential tools for Python string processing Learning when to use search versus match can be challenging so let's take a look at how they are different When we use search and match with the same pattern and string with the pattern is at the beginning of the string we see we find identical matches That is the case with matching and searching abcde with the pattern abc When we use search for a pattern that appears later in the string we get a result but we don't get the same result using match This is because match will try and match a string from the beginning until it cannot match any longer Search will go through the ENTIRE string to look for match options If you need to find a pattern that might not be at the beginning of the string you should use search If you want to be specific about the composition of the entire string or at least the initial pattern then you should use match Now it's your turn to try some tokenization DataCamp PythonTutorial Natural tokenization Language Processing Python