Search for string in text

Assuming that we have a text the quick brown fox jumped over the lazy dog, and we want to search for e.g quick in the text.

import re

text = "the quick brown fox jumped over the lazy dog"

match = re.search("quick", text)

As said in .search() documentation, this method will look for the first location where it finds a match, and returns a re.Match object if found, otherwise returns None.

If we print(match), we’ll see <re.Match object; span=(4, 9), match='quick'> which indicate that the matching string starts at the index 4 and ends at index 9 exclusively.

To get the matched value that the re.Match object is holding, we can simply use

match.group()

Find characters by type

Assuming we’re now working with a slightly different bit of text from the example above

import re

text = "the quick brown fox jumped over the lazy dog 1234567890 !@#$%^&*()_"

Find alphanumeric characters

To find all the word characters, we can use regex expression \w.

characters = re.findall("\w", text)

When printing the result characters, we’ll get all the characters in the text splited into a list, however, !@#$%^&*() won’t be returned as they are not considered word characters, except _.

['t','h','e','q','u','i','c','k','b','r','o','w','n','f','o','x','j','u','m','p','e','d','o','v','e','r','t','h','e','l','a','z','y','d','o','g','1','2','3','4','5','6','7','8','9','0','_']

Find any characters

To find any character, doesn’t matter if it’s word character or not, use .

any_characters = re.findall(".", text)

Note that now the result also contains whitespaces ' '

['t','h','e',' ','q','u','i','c','k',' ','b','r','o','w','n',' ','f','o','x',' ','j','u','m','p','e','d',' ','o','v','e','r',' ','t','h','e',' ','l','a','z','y',' ','d','o','g',' ','1','2','3','4','5','6','7','8','9','0',' ','!','@','#','$','%','^','&','*','(',')','_']

Find non-word characters

Opposite to \w, we have \W (uppercase) that we can use to find all non-word characters

non_word_characters = re.findall(".", text)

The result now only contains whitespaces and symbols characters

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')']