Introduction
Do you often need to process text data in various formats? Are you troubled by tedious string processing logic? Today I want to share with you a powerful text processing tool - regular expressions. As a Python programmer, mastering regular expressions is almost essential. Let's begin this learning journey together.
Understanding
I remember how I felt when I first started learning regular expressions - looking at those expressions made up of special characters like "passwords" was truly intimidating. However, as I studied and practiced more deeply, I gradually discovered that regular expressions are actually a very elegant and practical tool.
Regular expressions are essentially a pattern matching expression language for strings. You can think of it as an "intelligent text searcher". For example, if you want to find all phone numbers in an article, you might need to write a lot of conditional logic using regular string processing methods, but with regular expressions you can do it with just one line of code.
Foundation
Before learning specific syntax, let's understand the most basic concept in regular expressions - metacharacters. These special characters are like the "building blocks" of regular expressions. By combining them, we can construct various complex matching patterns.
Let's look at some of the most commonly used metacharacters:
.
- Matches any single character (except newline)
^
- Matches the start of string
$
- Matches the end of string
*
- Matches the previous pattern zero or more times
+
- Matches the previous pattern one or more times
?
- Matches the previous pattern zero or one time
You might ask, these symbols look abstract, how do we memorize them? My suggestion is: don't memorize by rote, but understand and apply them through practical cases.
Practice
Let's see how regular expressions work through some practical examples.
First, let's look at a simple example - matching email addresses:
import re
text = "My email is [email protected], work email is [email protected]"
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(pattern, text)
print(f"Found email addresses: {emails}")
Want to know how this regular expression works? Let's break it down:
\b
represents word boundary[A-Za-z0-9._%+-]+
matches email name part@
matches @ symbol[A-Za-z0-9.-]+
matches domain name part\.
matches dot[A-Z|a-z]{2,}
matches top-level domain
Let's look at another more practical example - extracting Chinese mobile phone numbers:
import re
text = """
Xiao Ming's phone is 13912345678
Xiao Hong's number is +86 139-1234-5678
Xiao Zhang's phone is 15987654321, backup number is 13812345678
"""
pattern = r'1[3-9]\d{9}'
phone_numbers = re.findall(pattern, text)
print(f"Found phone numbers: {phone_numbers}")
This regular expression means:
- 1
matches first digit 1
- [3-9]
matches second digit 3-9
- \d{9}
matches the following 9 digits
Advanced
After mastering the basics, let's look at some more advanced applications.
Group Matching
Sometimes we not only need to match text but also extract specific parts. This is where grouping comes in:
import re
log = "2024-01-15 10:30:45 [ERROR] Failed to connect to database"
pattern = r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) \[(\w+)\] (.*)'
match = re.match(pattern, log)
if match:
date, time, level, message = match.groups()
print(f"Date: {date}")
print(f"Time: {time}")
print(f"Level: {level}")
print(f"Message: {message}")
Greedy vs Non-Greedy Matching
This is a very important concept in regular expressions. Look at this example:
import re
text = "<div>First part</div><div>Second part</div>"
pattern1 = r'<div>.*</div>'
print("Greedy matching result:", re.findall(pattern1, text))
pattern2 = r'<div>.*?</div>'
print("Non-greedy matching result:", re.findall(pattern2, text))
You'll find that greedy matching will match as many characters as possible, while non-greedy matching does the opposite. Non-greedy matching is often more useful when dealing with markup languages like HTML.
Optimization
Here are some performance optimization tips when using regular expressions:
- Use re.compile() to pre-compile regular expressions:
import re
import time
text = "[email protected] " * 10000
start_time = time.time()
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
for _ in range(100):
re.findall(pattern, text)
print(f"Time without pre-compilation: {time.time() - start_time:.4f} seconds")
start_time = time.time()
pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
for _ in range(100):
pattern.findall(text)
print(f"Time with pre-compilation: {time.time() - start_time:.4f} seconds")
- Avoid using overly complex regular expressions:
import re
bad_pattern = r'^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$'
def check_password(password):
if len(password) < 8:
return False
if not any(c.isupper() for c in password):
return False
if not any(c.islower() for c in password):
return False
if not any(c.isdigit() for c in password):
return False
if not any(c in '@$!%*?&' for c in password):
return False
return True
Pitfalls
There are some common pitfalls to watch out for when using regular expressions:
- Handling escape characters:
import re
pattern1 = '\d+' # This will be interpreted as a regular string by Python
pattern2 = r'\d+' # Use raw string
- Using character sets:
import re
pattern1 = r'[a-Z]' # This range is invalid
pattern2 = r'[a-zA-Z]' # Specify upper and lower case ranges separately
Practical Applications
Finally, let's look at some regular expression patterns commonly used in actual work:
- URL validation:
import re
def is_valid_url(url):
pattern = r'^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$'
return bool(re.match(pattern, url))
urls = [
'https://www.example.com',
'http://subdomain.example.com/path?param=value',
'not_a_url',
'ftp://invalid.com'
]
for url in urls:
print(f"{url} is valid URL: {is_valid_url(url)}")
- Extracting Chinese characters:
import re
def extract_chinese(text):
pattern = r'[\u4e00-\u9fa5]+'
return re.findall(pattern, text)
text = "Hello世界!Python编程很有趣123"
chinese_chars = extract_chinese(text)
print(f"Extracted Chinese characters: {chinese_chars}")
- Date formatting:
import re
def format_date(date_string):
pattern = r'(\d{4})-?(\d{2})-?(\d{2})'
match = re.match(pattern, date_string)
if match:
year, month, day = match.groups()
return f"{year}年{month}月{day}日"
return "Invalid date format"
dates = ['20240115', '2024-01-15', '2024/01/15']
for date in dates:
print(f"{date} formatted: {format_date(date)}")
Summary
Regular expressions are a powerful tool, and mastering them takes time and practice. I suggest starting with simple patterns and gradually increasing complexity. In practical applications, you'll find that regular expressions can greatly simplify text processing work.
Remember, writing a good regular expression isn't just about implementing functionality, but also about readability and performance. Appropriate comments and documentation can help other developers (including your future self) better understand your code.
What do you think is the most difficult part of regular expressions to master? Feel free to share your experiences and confusions in the comments. Next time we can explore more advanced regular expression applications, such as backreferences and lookaround assertions.