Regular Expression in Python

If you are looking to either look validate the format of a string or to find specific instances, regular expressions are very powerful. The Regex module in python provides a useful set of tools to do this, although this great power can sometimes be difficult to understand. Here, we select a few key things that might not be immediately obvious when using regex.

Python using Laptop

Regex Findall

Find all, as the name implies, is meant to find all matching substrings within your string. The problem I had though, was that it was not actually finding them all. Consider the following, where we have essentially two versions of a regex to find consecutive digits in a string.

import re

myStrings = ["523563" ,"100001" , "110000" , "552523" , "542361"]

reg_string_list = [  "\d\d", "[0-9][0-9]" ]

print("Learn Findall by Examples")

for ms in myStrings:
  print(ms)
  for reg in reg_string_list:
    result = re.findall(reg,ms)
    print("  " + reg + " : " + str(result) 

When we run this, we can see that it only finds non-overlapping pairs. This is because the regex is consuming the string as it goes.

python3 regex.py
Learn Findall by Examples
523563
  \d\d : ['52', '35', '63'] 3
  [0-9][0-9] : ['52', '35', '63']
100001
  \d\d : ['10', '00', '01'] 3
  [0-9][0-9] : ['10', '00', '01'] 
110000
  \d\d : ['11', '00', '00'] 3
  [0-9][0-9] : ['11', '00', '00'] 
552523
  \d\d : ['55', '25', '23'] 3
  [0-9][0-9] : ['55', '25', '23'] 
542361
  \d\d : ['54', '23', '61'] 3
  [0-9][0-9] : ['54', '23', '61'] 

We can prevent this by using the “?=…” construct, which does not consume the string as it goes, allowing us to find all pairs.

import re

myStrings = ["523563" ,"100001" , "110000" , "552523" , "542361"]

reg_string_list = [  "\d\d", "[0-9][0-9]", "(?=(\d\d))"]

print("Learn Findall by Examples")

for ms in myStrings:
  print(ms)
  for reg in reg_string_list:
    result = re.findall(reg,ms)
    print("  " + reg + " : " + str(result) 

Which results in

python3 regex.py
Learn Findall by Examples
523563
  \d\d : ['52', '35', '63'] 
  [0-9][0-9] : ['52', '35', '63'] 
  (?=(\d\d)) : ['52', '23', '35', '56', '63'] 
100001
  \d\d : ['10', '00', '01'] 
  [0-9][0-9] : ['10', '00', '01'] 
  (?=(\d\d)) : ['10', '00', '00', '00', '01'] 
110000
  \d\d : ['11', '00', '00'] 
  [0-9][0-9] : ['11', '00', '00'] 
  (?=(\d\d)) : ['11', '10', '00', '00', '00'] 
552523
  \d\d : ['55', '25', '23'] 
  [0-9][0-9] : ['55', '25', '23'] 
  (?=(\d\d)) : ['55', '52', '25', '52', '23'] 
542361
  \d\d : ['54', '23', '61'] 
  [0-9][0-9] : ['54', '23', '61'] 
  (?=(\d\d)) : ['54', '42', '23', '36', '61']