Python Tutorial Advanced tokenization with NLTK and regex

0 Views

20 Sep 2021

Want to learn more Take the full course at a https 3A 2F 2Flearn datacamp com 2Fcourses 2Fintroduction-to-natural-language-processing-in-python a at your own pace More than a video you'll learn hands-on coding quickly apply skills to your daily work --- In this video we'll take a look at doing more advanced tokenization with regex One new regex pattern you will find useful for advanced tokenization is the ability to use the or method In regex OR is represented by the pipe character To use the or you can define a group using parenthesis Groups can be either a pattern or a set of characters you want to match You can also define explicit character classes using square brackets We'll go a bit more into depth on groups and ranges soon Let's take an example that we want to tokenize using regular expressions and we want to find all digits and words We define our pattern using a group with the OR symbol and make them greedy so they catch the full word or digits Then we can call findall using Python's re library and return our tokens Notice that our pattern does not match punctuation but properly matches the words and digits Let's take a look at another more advanced topic defining groups and character ranges Here we have another chart of patterns and this time we are using ranges or character classes marked by the square brackets and groups marked by the parentheses We can see in this chart that we can use square brackets to define a new character class For example we can match all upper and lowercase english letters using Uppercase A hyphen Uppercase Z which will match all uppercase and then lowercase a hyphen lowercase z which will match all lowercase letters We can also make ranges to match all digits 0 hyphen 9 or perhaps a more complex range like uppercase and lowercase English with the hyphen and period Because the hyphen and period are special characters in regex we must tell regex we mean an ACTUAL period or hyphen To do so we use what is called an escape character and in regex that means to place a backwards slash in front of our character so it knows then to look for a hyphen or period On the other hand with groups which are designated by the parentheses we can only match what we explicitly define in the group So a-z matched only a a hyphen and z Groups are useful when you want to define an explicit group such as the final example where we are taking spaces or commas In this code example we can use match with a character range to match all lowercase ascii any digits and spaces It is greedy marked by the after the range definition but once it hits the comma it can't match anymore This short example demonstrates that thinking about what regex method you use such as search versus match and whether you define a group or a range can have a large impact on the usefulness and readability of your patterns Now it's your turn to practice advanced regex techniques to help with tokenization DataCamp tokenization NLTK egex PythonTutorial Natural Language Processing Python