In this tutorial we are going to talk about regular expressions and their implementation or usage in the Python programming language.
We’ll be covering the following topics in this tutorial:
What is RegEx?
Regular expressions in Python also write in short as RegEx, or you can also pronounce it as regX. In simple words, a regular expression is a sequence of characters that define the search pattern. We know that a sequence of characters is simply a string. A string that acts as a pattern for searching something in any given text can term as a regular expression.
what do we mean by a pattern?
It’s just a strategy that we use to identify text. So a pattern can mean something like three digits in a row or two alphabetic letters in a row or the letters BCA in sequence, or any number of whitespace characters in a row. It’s just a search pattern. It’s a strategy to identify text, and the applications in the real world are vast.
For example, we may need to pass a big chunk of text and find the nested email address within it. An email address has a particular pattern.
We have the @ sign in the middle, and then we have something before it and something afterward. Or we can, for example, be looking for a phone number. A phone number has a specific pattern as well. It’s a sequence of numbers. And usually, those numbers are separated by spaces or dashes or slashes or something like that. Or if we’re looking for something like a zip code within the United States, we can write a pattern to search for five digits in a row.
So regular expressions are just an internal language built into Python that allows us to identify and write out those strategies to help identify snippets of text within larger chunks of text.
To work with regular expressions will have to begin by importing a module from within the standard library called re. That is short for regular expressions.
If the pattern does not exist, we’re going to get a none object to represent nowness or nothingness. And if the pattern does match in the string that we pass in, we’re going to get a different type of object called a match.
Let’s take a look at both of those scenarios. First up, let’s pass in a string like candy.
import re pattern = re.compile("flower") print(type(pattern)) print(pattern.search("candy"))
So, again, Python and regular expressions is going to look for this combination of characters flower within this string of candy.
We’re going to see it’s going to be the none object whenever Python cannot find a match using the regular expression pattern and returns None.
So what I’m going to do below is I’m going to once again invoke the search method on my pattern object and I’m going to give it a string like flower power.
import re pattern = re.compile("flower") print(type(pattern)) print(pattern.search("candy")) match = pattern.search("flower power") print(type(match))
So now this combination of six characters that we specified in here is going to exist at some point in this string. So we’re going to get a match object right here on the right hand side.
Now, that match object is going to have some helpful methods to help us figure out where the match occurred.
For example, on my match object, I can call a method called group and group is going to return the actual string that’s matched.
import re pattern = re.compile("flower") print(type(pattern)) print(pattern.search("candy")) match = pattern.search("flower power") print(type(match)) print(match.group())
So within flower power with the pattern of flower, the pattern that was identified was flower.
Regular Expression in Python and Their Uses
Metacharacters
Metacharacters are characters which are interpreted in a particular way.
Metacharacter is a character with the specified meaning.
Metacharacter Description Example
[] Specifies set of characters to match. “[a-z]”
\ Treat meta characters as ordinary characters. “\r”
. Matches any single character except a newline. “Ja.v.”
^ Match the starting character of the string. “^Java”
$ Match ending character of the string. “point”
* Matches zero or more occurrence of the pattern left to it. “hello*”
+ Matches one or more occurrences of the pattern left to it. “hello+”
{} Match for a specific number of pattern occurrences in a string. “java{2}”
| Either/Or “java|point”
() Group various patterns.
Special Sequences
Special sequences are the sequences containing \ followed by one of the characters.
Character Description
\A Return a match if the pattern is at the start of the string.
\b Return a match if the pattern is at the beginning or end of a word.
\B Return a match if the pattern is present but not at the beginning or end of a word.
\d Return a match where the string contains digits.
\D Return a match where the string does not contain digits.
\s Return a match where the string contains a white space character.
\S Return a match where the string does not contain a white space character.
\w Return a match where the string contains any word character.
\W Return a match where the string does not contain any word character.
\Z Return a match if the pattern is at the end of the string.
Sets
A set is a group of characters given inside a pair of square brackets. It represents the special meaning.
SN Set Description
1 [arn] Returns a match if the string includes some defined characters in the sequence.
2 [a-n] Returns a match if the string contains any characters between a to n.
3 [^arn] Returns a match if the string includes the characters except a, r, and n.
4 [0123] Returns a match if the string includes any specified digits.
5 [0-9] Returns a match if the string is between 0 and 9 digits.
6 [0-5][0-9] Returns a match if the string is between 00 and 59 digits.
10 [a-zA-Z] Returns a match if there is some alphabet in the string (lower-case or upper-case).
Regular Expressions Methods in Python
1. let us suppose we are to find string for a particular match. So such for ape in the string.
import re # Search for ape in the string if re.search("ape","The ape was at the apex") print("There is an ape") Output: There is an ape
Now if we do this searching, we are finding that there is and if so, when this particular added or such return are true, then this respective message will get printed.
2. Next, we’re going to find all this function returns a list of matches.
import re # findall() return a list of matches # . is used to match only 1 character or space allApes = re.findall("ape.","The ape was at the apex") for i in allApes: print(i) Output: ape apex
So Dot it to match any one character.Dot Will is nothing but one wildcard character, which will be denoting any single character or espace.
3. Next, we are going for this finditer, which returns and iterator of matching objects and you spend to get the location.
theStr = "The ape was at the apex" for i in re.finditer("ape.",theStr): # Span returns a tuple locTuple = i.span() print(locTuple) # Slice the match out using the tuple values print(theStr[locTuple[0]:locTuple[1]]) Output: (4,8) ape (19,23) apex
4. Now Square brackets will match any one of the character between the brackets not including upper and lowercase varieties unless they are listed.
animalStr = Cat rat mat fat pat" allAnimals = re.findall("[crmfp]at",animalStr) for i in allAnimals: print(i) print() Output: rat mat fat pat
5. We can also allow for characters in a range.
animalStr = "Cat rat mat fat pat" someAnimals = re.findall("[c-mC-M]at",animalStr) for i in someAnimals: print(i) print() Output: Cat mat fat
6. Next Use ^ to denote any character but whatever characters are between the brackets.
animalStr = "Cat rat mat fat pat" someAnimals = re.findall("[^Cr]at", animalStr) for i in someAnimals: print(i) print() Output: mat fat pat
7. Replace maching items in a string
owlFood = "rat cat mat pat" # You can compile a regex into pattern objects which provide additional methods. regex = re.compile("[cr]at") # sub() replaces items that match the regex in the string with the 1st attribute string passed to sub owlFood = regex.sub("owl",owlFood) print(owlFood) Output: owl owl mat pat
8. Regex use the backslash to designate special characters and Python does the same inside strings which causes issues.Lets try to get “”\\stuff out of a string.
randStr = "Here is \\stuff" # This won't find it print("Find \\stuff : ",re.search("\\stuff", randStr)) #This does, but we have to put in 4 slashes which is messy print("Find \\stuff: ", research("\\\\stuff", randStr)) # You can get around this by using raw string which don't treat backslashes as special print("Find \\stuff: ", re.search(r"\\stuff", randStr)) Output Find \stuff: None Find \stuff: <_sre.SRE_Match object; span=(8,14), match='\\stuff'> Find \stuff: <_sre.SRE_Match object; span=(8,14), match='\\stuff'>
9. We saw that . matches any character, but what if we want to match a period. Backslash the period. You do the same with[,] and others
randStr= " F.B.I. I.R.S. CIA" print("Matches :", len(re.findall(".\..\..",randStr))) print("Matches :", re.findall(".\..\.."",randStr)) Matches : 2 Matches : ['F.B.I', 'I.R.S']
10. We can match many whitespace characters
randStr = """This is a long string that goes on for many lines""" print(randStr) #Remove newlines regex = re.compile("\n") randStr = regex.sub(" ", randStr) print(randStr) # You can also match # \b : backspace # \f : Form Feed # \r : Carriage Return # \t : Tab # \v : vertical Tab # You may need to remove \r\n on Windows Output : This is a long string that goes on for many lines This is a long string that goes on for many lines
import re # \d can be used instead of [0-9] # \D is the same as [^0-9] randStr = "12345" print("Matches :", len(re.findall("\d",randStr)))) Output: Matches : 5
12. You can match multiple digits by following the \d with {numOfValues}
#Match 5 numbers only if re.search("\d{5}","12345"): print("It is a zip code") # You can also match within a range. Match values that are between 5 and 7 digits. numStr = "123 12345 123456 1234567" print("Matches :", len(re.findall("\d{5,7}", numStr))) Output : It is a zip code Matches : 3