Regex in Python (part 1)
Search for string in text⌗
Assuming that we have a text the quick brown fox jumped over the lazy dog
, and we want to search for e.g quick
in the text.
import re
text = "the quick brown fox jumped over the lazy dog"
match = re.search("quick", text)
As said in .search()
documentation, this method will look for the first location where it finds a match, and returns a re.Match
object if found, otherwise returns None
.
If we print(match)
, we’ll see <re.Match object; span=(4, 9), match='quick'>
which indicate that the matching string starts at the index 4
and ends at index 9
exclusively.
To get the matched value that the re.Match
object is holding, we can simply use
match.group()
Find characters by type⌗
Assuming we’re now working with a slightly different bit of text from the example above
import re
text = "the quick brown fox jumped over the lazy dog 1234567890 !@#$%^&*()_"
Find alphanumeric characters⌗
To find all the word characters, we can use regex expression \w
.
characters = re.findall("\w", text)
When printing the result characters
, we’ll get all the characters in the text splited into a list, however, !@#$%^&*()
won’t be returned as they are not considered word characters, except _
.
['t','h','e','q','u','i','c','k','b','r','o','w','n','f','o','x','j','u','m','p','e','d','o','v','e','r','t','h','e','l','a','z','y','d','o','g','1','2','3','4','5','6','7','8','9','0','_']
Find any characters⌗
To find any character, doesn’t matter if it’s word character or not, use .
any_characters = re.findall(".", text)
Note that now the result also contains whitespaces ' '
['t','h','e',' ','q','u','i','c','k',' ','b','r','o','w','n',' ','f','o','x',' ','j','u','m','p','e','d',' ','o','v','e','r',' ','t','h','e',' ','l','a','z','y',' ','d','o','g',' ','1','2','3','4','5','6','7','8','9','0',' ','!','@','#','$','%','^','&','*','(',')','_']
Find non-word characters⌗
Opposite to \w
, we have \W
(uppercase) that we can use to find all non-word characters
non_word_characters = re.findall(".", text)
The result now only contains whitespaces and symbols characters
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')']