Regular expressions, often abbreviated as RegEx, are powerful tools for pattern matching and text manipulation in Python. They provide a flexible and efficient way to search, extract, and modify strings based on specific patterns. In this comprehensive guide, we'll explore the world of Python RegEx, from basic concepts to advanced techniques, with plenty of practical examples along the way.
Introduction to Regular Expressions in Python
Regular expressions are sequences of characters that define search patterns. Python's re
module provides support for regular expressions, allowing developers to perform complex string operations with ease.
To get started with RegEx in Python, you'll need to import the re
module:
import re
Basic Pattern Matching
Let's begin with some simple pattern matching examples to understand the fundamentals of RegEx in Python.
Matching Simple Strings
To find a literal string within another string, you can use the re.search()
function:
text = "Hello, World!"
pattern = r"World"
match = re.search(pattern, text)
if match:
print("Pattern found!")
else:
print("Pattern not found.")
# Output: Pattern found!
🔍 In this example, we use the r
prefix before the pattern string to create a raw string, which treats backslashes as literal characters.
Matching at the Beginning or End of a String
To match patterns at the beginning or end of a string, use the ^
and $
anchors:
# Match at the beginning
text = "Python is awesome"
pattern = r"^Python"
match = re.search(pattern, text)
print(match.group() if match else "No match")
# Output: Python
# Match at the end
text = "I love coding in Python"
pattern = r"Python$"
match = re.search(pattern, text)
print(match.group() if match else "No match")
# Output: Python
Character Classes and Quantifiers
Character classes allow you to match specific sets of characters, while quantifiers specify how many times a character or group should appear.
Character Classes
Here are some common character classes:
\d
: Matches any digit (0-9)\w
: Matches any alphanumeric character (a-z, A-Z, 0-9, and underscore)\s
: Matches any whitespace character (space, tab, newline)
Let's see them in action:
text = "The year is 2023, and the temperature is 25°C."
# Match digits
digit_pattern = r"\d+"
digits = re.findall(digit_pattern, text)
print("Digits found:", digits)
# Output: Digits found: ['2023', '25']
# Match words
word_pattern = r"\w+"
words = re.findall(word_pattern, text)
print("Words found:", words)
# Output: Words found: ['The', 'year', 'is', '2023', 'and', 'the', 'temperature', 'is', '25', 'C']
Quantifiers
Quantifiers specify how many times a character or group should appear:
*
: 0 or more occurrences+
: 1 or more occurrences?
: 0 or 1 occurrence{n}
: Exactly n occurrences{n,m}
: Between n and m occurrences
Here's an example using quantifiers:
text = "The quick brown fox jumps over the lazy dog."
# Match words with 3 to 5 characters
pattern = r"\b\w{3,5}\b"
matches = re.findall(pattern, text)
print("Words with 3 to 5 characters:", matches)
# Output: Words with 3 to 5 characters: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
Groups and Capturing
Groups allow you to treat multiple characters as a single unit and capture specific parts of a match.
Basic Grouping
Use parentheses ()
to create groups:
text = "John Doe ([email protected])"
pattern = r"(\w+) (\w+) \((\w+@\w+\.\w+)\)"
match = re.search(pattern, text)
if match:
print("Full Name:", match.group(1), match.group(2))
print("Email:", match.group(3))
# Output:
# Full Name: John Doe
# Email: [email protected]
Named Groups
You can assign names to groups for easier reference:
text = "John Doe ([email protected])"
pattern = r"(?P<first_name>\w+) (?P<last_name>\w+) \((?P<email>\w+@\w+\.\w+)\)"
match = re.search(pattern, text)
if match:
print("First Name:", match.group("first_name"))
print("Last Name:", match.group("last_name"))
print("Email:", match.group("email"))
# Output:
# First Name: John
# Last Name: Doe
# Email: [email protected]
Lookahead and Lookbehind Assertions
Lookahead and lookbehind assertions allow you to match patterns based on what comes before or after them, without including those parts in the match.
Positive Lookahead
Match a pattern only if it's followed by another pattern:
text = "I have 100 dollars and 200 euros."
pattern = r"\d+(?= dollars)"
matches = re.findall(pattern, text)
print("Amount in dollars:", matches)
# Output: Amount in dollars: ['100']
Negative Lookahead
Match a pattern only if it's not followed by another pattern:
text = "File1.txt File2.jpg File3.pdf"
pattern = r"\w+(?!\.jpg)"
matches = re.findall(pattern, text)
print("Files not ending with .jpg:", matches)
# Output: Files not ending with .jpg: ['File1', 'txt', 'File2', 'File3', 'pdf']
Positive Lookbehind
Match a pattern only if it's preceded by another pattern:
text = "The price is $100 and €50."
pattern = r"(?<=\$)\d+"
matches = re.findall(pattern, text)
print("Amount in dollars:", matches)
# Output: Amount in dollars: ['100']
Negative Lookbehind
Match a pattern only if it's not preceded by another pattern:
text = "123abc 456def 789ghi"
pattern = r"(?<!\d)abc"
matches = re.findall(pattern, text)
print("'abc' not preceded by a digit:", matches)
# Output: 'abc' not preceded by a digit: []
Greedy vs. Non-Greedy Matching
By default, quantifiers in regular expressions are greedy, meaning they match as much as possible. You can make them non-greedy (lazy) by adding a ?
after the quantifier.
text = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
# Greedy matching
greedy_pattern = r"<p>.*</p>"
greedy_match = re.search(greedy_pattern, text)
print("Greedy match:", greedy_match.group())
# Output: Greedy match: <p>This is a paragraph.</p><p>This is another paragraph.</p>
# Non-greedy matching
non_greedy_pattern = r"<p>.*?</p>"
non_greedy_matches = re.findall(non_greedy_pattern, text)
print("Non-greedy matches:", non_greedy_matches)
# Output: Non-greedy matches: ['<p>This is a paragraph.</p>', '<p>This is another paragraph.</p>']
Flags and Modifiers
Python's re
module provides several flags to modify the behavior of regular expressions:
re.IGNORECASE
orre.I
: Case-insensitive matchingre.MULTILINE
orre.M
:^
and$
match at the beginning and end of each linere.DOTALL
orre.S
:.
matches newline charactersre.VERBOSE
orre.X
: Allows you to write more readable regular expressions
Here's an example using flags:
text = """
First line
SECOND LINE
Third line
"""
pattern = r"^second line$"
# Without flags
match = re.search(pattern, text)
print("Without flags:", match)
# Output: Without flags: None
# With flags
match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
print("With flags:", match.group() if match else "No match")
# Output: With flags: SECOND LINE
Advanced Techniques
Let's explore some advanced RegEx techniques that can be incredibly useful in complex scenarios.
Backreferences
Backreferences allow you to refer to previously captured groups within the same regular expression:
text = "The cat sat on the mat. The rat ate the cat's food."
pattern = r"\b(\w+)\s+\1\b"
matches = re.findall(pattern, text)
print("Repeated words:", matches)
# Output: Repeated words: ['the']
Conditional Matching
You can use conditional statements within regular expressions to match patterns based on certain conditions:
def match_date(text):
pattern = r"(\d{2})(-|/)(\d{2})\2(\d{4})"
match = re.search(pattern, text)
if match:
return f"Date found: {match.group()}"
return "No date found"
print(match_date("Today is 15-03-2023"))
print(match_date("Today is 15/03/2023"))
print(match_date("Today is 15.03.2023"))
# Output:
# Date found: 15-03-2023
# Date found: 15/03/2023
# No date found
Recursive Patterns
Python's regex
module (an alternative to re
) supports recursive patterns, which can be useful for matching nested structures:
import regex
text = "((a+b)*c)"
pattern = r"\((?:[^()]++|\g<0>)*\)"
match = regex.search(pattern, text)
print("Matched nested parentheses:", match.group())
# Output: Matched nested parentheses: ((a+b)*c)
Performance Considerations
When working with regular expressions, keep these performance tips in mind:
- 🚀 Compile patterns you use frequently:
compiled_pattern = re.compile(r"\d+")
numbers = compiled_pattern.findall("123 456 789")
- ⚡ Use non-capturing groups
(?:...)
when you don't need to extract the group:
pattern = r"(?:https?://)?(?:www\.)?example\.com"
- 🎯 Be specific with your patterns to avoid unnecessary backtracking:
# Less efficient
pattern = r".*@.*\.com"
# More efficient
pattern = r"[^@]+@[^@]+\.com"
- 🧠 Use appropriate quantifiers and avoid overuse of
.*
:
# Less efficient
pattern = r"<.*>"
# More efficient
pattern = r"<[^>]*>"
Conclusion
Regular expressions in Python are a powerful tool for pattern matching and text manipulation. From basic string matching to complex parsing tasks, RegEx can significantly simplify your code and improve its efficiency. As you've seen in this comprehensive guide, Python's re
module provides a wide range of features to handle various text processing scenarios.
Remember that while regular expressions are incredibly useful, they can also become complex and hard to read. Always strive for clarity and maintainability in your RegEx patterns, and don't hesitate to break down complex patterns into smaller, more manageable pieces.
With practice and experience, you'll become more proficient in crafting efficient and effective regular expressions, enhancing your Python programming skills and solving complex text processing challenges with ease.