1
Current Location:
>
Regular Expressions
The Art of Conquering Regular Expressions: A Journey from Beginner to Python Expert
Release time:2024-12-03 13:52:33 read: 14
Copyright Statement: This article is an original work of the website and follows the CC 4.0 BY-SA copyright agreement. Please include the original source link and this statement when reprinting.

Article link: https://haoduanwen.com/en/content/aid/2338?s=en%2Fcontent%2Faid%2F2338

Opening Thoughts

Have you ever encountered these frustrations: wanting to extract all phone numbers from a large text, or validating whether a user's input email address is legitimate? Using ordinary string processing methods, the code might become messy and error-prone. In fact, all these problems can be elegantly solved using regular expressions.

Today, I want to share my insights from learning and using regular expressions. As a Python developer, I deeply appreciate the power of regular expressions. It's like a Swiss Army knife that can easily handle various text processing challenges.

The Regular Expression Code

Regular expressions may look like a mysterious combination of symbols, but they actually have clear meanings. Let's uncover the mysteries behind these symbols together.

First, we need to import Python's re module:

import re

In regular expressions, each symbol is like a unique code. For example, the dot (.) matches any character, while the asterisk (*) means repeat zero or more times. These symbols, when combined, can describe various complex text patterns.

I remember when I first started learning regular expressions, I was always confused by these symbols. Until one day, I suddenly thought of comparing them to Lego blocks - each symbol is a basic block, and through different combinations, we can build powerful patterns.

Practical Techniques

Let's master the use of regular expressions through some practical examples.

Basic Matching

text = "Python编程很有趣,Python让我着迷"
pattern = r"Python"
matches = re.findall(pattern, text)
print(matches)  # Output: ['Python', 'Python']

This example looks simple, but it demonstrates the most basic usage of regular expressions. We use the findall() function to find all matches. In actual work, this simple matching is often used, such as counting how many times a keyword appears in an article.

Phone Number Matching

text = "我的手机号是13812345678,家里电话是0101234567"
pattern = r"1[3-9]\d{9}|0\d{9,10}"
matches = re.findall(pattern, text)
print(matches)  # Output: ['13812345678', '0101234567']

This example shows how to match Chinese mobile and landline numbers. In the pattern, 1[3-9] means starting with 1 followed by a digit 3-9, \d{9} means 9 digits following. The vertical bar (|) means OR, used to match different number formats simultaneously.

Practical Cases

Let's look at some scenarios commonly encountered in actual development.

Email Address Validation

def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))


test_emails = [
    "[email protected]",
    "invalid.email@com",
    "[email protected]"
]

for email in test_emails:
    print(f"{email}: {'valid' if is_valid_email(email) else 'invalid'}")

This email validation regular expression looks complex, but we can break it down into several parts to understand: - ^ indicates the start - [a-zA-Z0-9._%+-]+ matches the username part - @ matches the @ symbol - [a-zA-Z0-9.-]+ matches the domain name part - \. matches the dot - [a-zA-Z]{2,} matches the top-level domain - $ indicates the end

Data Cleaning

def clean_text(text):
    # Remove excess whitespace
    text = re.sub(r'\s+', ' ', text)
    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Convert to lowercase
    return text.lower().strip()


messy_text = """
    这是一段   很乱的文本!!!
    有多余的空格   和特殊符号@#¥%……&*
"""
print(clean_text(messy_text))

This example shows how to use regular expressions to clean text data. Such preprocessing steps are very important in data analysis and natural language processing. I often use this method to clean user input data or web crawler content.

Advanced Techniques

As your understanding of regular expressions deepens, you'll discover some advanced uses that can make your code more elegant and efficient.

Named Capture Groups

def parse_name(full_name):
    pattern = r'(?P<last>[a-zA-Z]+),\s*(?P<first>[a-zA-Z]+)'
    match = re.match(pattern, full_name)
    if match:
        return match.groupdict()
    return None


name = "Smith, John"
result = parse_name(name)
print(result)  # Output: {'last': 'Smith', 'first': 'John'}

Named capture groups allow us to give meaningful names to groups in regular expressions. This is particularly useful when dealing with complex text structures, like parsing log files or configuration files.

Non-Greedy Matching

text = "<p>第一段</p><p>第二段</p>"

greedy_pattern = r'<p>.*</p>'

non_greedy_pattern = r'<p>.*?</p>'

print(re.findall(greedy_pattern, text))     # Output: ['<p>第一段</p><p>第二段</p>']
print(re.findall(non_greedy_pattern, text)) # Output: ['<p>第一段</p>', '<p>第二段</p>']

Non-greedy matching is particularly useful when dealing with markup languages like HTML or XML. By adding a question mark (?) after quantifiers, we can achieve minimum matching instead of the default greedy matching.

Performance Optimization

When using regular expressions, performance is also an important factor to consider. Here are some optimization suggestions:

import time


def find_without_compile(text, iterations=1000):
    pattern = r'\b\w+@\w+\.\w+\b'
    start = time.time()
    for _ in range(iterations):
        re.findall(pattern, text)
    return time.time() - start


def find_with_compile(text, iterations=1000):
    pattern = re.compile(r'\b\w+@\w+\.\w+\b')
    start = time.time()
    for _ in range(iterations):
        pattern.findall(text)
    return time.time() - start


test_text = "联系方式:[email protected][email protected]"
time1 = find_without_compile(test_text)
time2 = find_with_compile(test_text)
print(f"Uncompiled time: {time1:.4f} seconds")
print(f"Pre-compiled time: {time2:.4f} seconds")

From my experience, pre-compilation is a good optimization method if you need to use the same regular expression multiple times. It can significantly improve program performance.

Conclusion

Regular expressions are like a miniature programming language, and mastering them requires time and practice. But once you understand their core concepts, you'll find they're an extremely powerful tool.

Did you know? Regular expressions can be traced back to 1951, when mathematician Stephen Cole Kleene invented the concept of regular languages while studying neural networks. Now, it has become an essential skill for every programmer.

Remember, the best way to learn regular expressions is through practice. You can start with simple patterns and gradually increase complexity. When applying this knowledge in actual projects, you'll find that regular expressions can help you solve many seemingly complex problems.

Do you now have a new understanding of regular expressions? Feel free to share your thoughts and experiences in the comments. If you have any questions, you can also leave a message to discuss. Let's explore more possibilities together in this magical world of regular expressions.

Python Regular Expressions: Mastering the Art of Text Processing from Scratch
Previous
2024-11-26 09:59:14
Python Regular Expressions: A Complete Guide from Basics to Practice
2024-12-05 09:29:33
Next
Related articles