1
Python regex tutorial, regular expressions Python, re module Python, Python text processing, regex pattern matching

2024-11-04

Introduction to Python Regular Expressions: Master Essential Text Processing Skills from Scratch

Introduction

Do you often need to process text data in various formats? Are you troubled by tedious string processing logic? Today I want to share with you a powerful text processing tool - regular expressions. As a Python programmer, mastering regular expressions is almost essential. Let's begin this learning journey together.

Understanding

I remember how I felt when I first started learning regular expressions - looking at those expressions made up of special characters like "passwords" was truly intimidating. However, as I studied and practiced more deeply, I gradually discovered that regular expressions are actually a very elegant and practical tool.

Regular expressions are essentially a pattern matching expression language for strings. You can think of it as an "intelligent text searcher". For example, if you want to find all phone numbers in an article, you might need to write a lot of conditional logic using regular string processing methods, but with regular expressions you can do it with just one line of code.

Foundation

Before learning specific syntax, let's understand the most basic concept in regular expressions - metacharacters. These special characters are like the "building blocks" of regular expressions. By combining them, we can construct various complex matching patterns.

Let's look at some of the most commonly used metacharacters:

. - Matches any single character (except newline) ^ - Matches the start of string $ - Matches the end of string * - Matches the previous pattern zero or more times + - Matches the previous pattern one or more times ? - Matches the previous pattern zero or one time

You might ask, these symbols look abstract, how do we memorize them? My suggestion is: don't memorize by rote, but understand and apply them through practical cases.

Practice

Let's see how regular expressions work through some practical examples.

First, let's look at a simple example - matching email addresses:

import re

text = "My email is [email protected], work email is [email protected]"
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

emails = re.findall(pattern, text)
print(f"Found email addresses: {emails}")

Want to know how this regular expression works? Let's break it down:

  • \b represents word boundary
  • [A-Za-z0-9._%+-]+ matches email name part
  • @ matches @ symbol
  • [A-Za-z0-9.-]+ matches domain name part
  • \. matches dot
  • [A-Z|a-z]{2,} matches top-level domain

Let's look at another more practical example - extracting Chinese mobile phone numbers:

import re

text = """
Xiao Ming's phone is 13912345678
Xiao Hong's number is +86 139-1234-5678
Xiao Zhang's phone is 15987654321, backup number is 13812345678
"""

pattern = r'1[3-9]\d{9}'
phone_numbers = re.findall(pattern, text)
print(f"Found phone numbers: {phone_numbers}")

This regular expression means: - 1 matches first digit 1 - [3-9] matches second digit 3-9 - \d{9} matches the following 9 digits

Advanced

After mastering the basics, let's look at some more advanced applications.

Group Matching

Sometimes we not only need to match text but also extract specific parts. This is where grouping comes in:

import re

log = "2024-01-15 10:30:45 [ERROR] Failed to connect to database"
pattern = r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) \[(\w+)\] (.*)'

match = re.match(pattern, log)
if match:
    date, time, level, message = match.groups()
    print(f"Date: {date}")
    print(f"Time: {time}")
    print(f"Level: {level}")
    print(f"Message: {message}")

Greedy vs Non-Greedy Matching

This is a very important concept in regular expressions. Look at this example:

import re

text = "<div>First part</div><div>Second part</div>"


pattern1 = r'<div>.*</div>'
print("Greedy matching result:", re.findall(pattern1, text))


pattern2 = r'<div>.*?</div>'
print("Non-greedy matching result:", re.findall(pattern2, text))

You'll find that greedy matching will match as many characters as possible, while non-greedy matching does the opposite. Non-greedy matching is often more useful when dealing with markup languages like HTML.

Optimization

Here are some performance optimization tips when using regular expressions:

  1. Use re.compile() to pre-compile regular expressions:
import re
import time


text = "[email protected] " * 10000


start_time = time.time()
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
for _ in range(100):
    re.findall(pattern, text)
print(f"Time without pre-compilation: {time.time() - start_time:.4f} seconds")


start_time = time.time()
pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
for _ in range(100):
    pattern.findall(text)
print(f"Time with pre-compilation: {time.time() - start_time:.4f} seconds")
  1. Avoid using overly complex regular expressions:
import re


bad_pattern = r'^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$'


def check_password(password):
    if len(password) < 8:
        return False
    if not any(c.isupper() for c in password):
        return False
    if not any(c.islower() for c in password):
        return False
    if not any(c.isdigit() for c in password):
        return False
    if not any(c in '@$!%*?&' for c in password):
        return False
    return True

Pitfalls

There are some common pitfalls to watch out for when using regular expressions:

  1. Handling escape characters:
import re


pattern1 = '\d+'  # This will be interpreted as a regular string by Python

pattern2 = r'\d+'  # Use raw string
  1. Using character sets:
import re


pattern1 = r'[a-Z]'  # This range is invalid

pattern2 = r'[a-zA-Z]'  # Specify upper and lower case ranges separately

Practical Applications

Finally, let's look at some regular expression patterns commonly used in actual work:

  1. URL validation:
import re

def is_valid_url(url):
    pattern = r'^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$'
    return bool(re.match(pattern, url))


urls = [
    'https://www.example.com',
    'http://subdomain.example.com/path?param=value',
    'not_a_url',
    'ftp://invalid.com'
]

for url in urls:
    print(f"{url} is valid URL: {is_valid_url(url)}")
  1. Extracting Chinese characters:
import re

def extract_chinese(text):
    pattern = r'[\u4e00-\u9fa5]+'
    return re.findall(pattern, text)


text = "Hello世界!Python编程很有趣123"
chinese_chars = extract_chinese(text)
print(f"Extracted Chinese characters: {chinese_chars}")
  1. Date formatting:
import re

def format_date(date_string):
    pattern = r'(\d{4})-?(\d{2})-?(\d{2})'
    match = re.match(pattern, date_string)
    if match:
        year, month, day = match.groups()
        return f"{year}{month}{day}日"
    return "Invalid date format"


dates = ['20240115', '2024-01-15', '2024/01/15']
for date in dates:
    print(f"{date} formatted: {format_date(date)}")

Summary

Regular expressions are a powerful tool, and mastering them takes time and practice. I suggest starting with simple patterns and gradually increasing complexity. In practical applications, you'll find that regular expressions can greatly simplify text processing work.

Remember, writing a good regular expression isn't just about implementing functionality, but also about readability and performance. Appropriate comments and documentation can help other developers (including your future self) better understand your code.

What do you think is the most difficult part of regular expressions to master? Feel free to share your experiences and confusions in the comments. Next time we can explore more advanced regular expression applications, such as backreferences and lookaround assertions.

Next

Introduction to Python Regular Expressions: Master Essential Text Processing Skills from Scratch

A comprehensive guide to Python regular expressions, covering fundamental concepts, special characters, re module functionality, and practical text processing examples for efficient pattern matching and manipulation

Python Regular Expressions: Mastering the Art of Text Processing from Scratch

A comprehensive guide to regular expressions in Python, covering basic concepts, core features of the re module, special characters usage, and practical email matching examples

A Magical Journey of Parsing Nested Parentheses with Python Regular Expressions

A comprehensive guide on handling nested parentheses matching in Python regular expressions, covering basic single-level matching to complex multi-level nesting, with solutions using recursive regex and recursive descent parsing

Next

Introduction to Python Regular Expressions: Master Essential Text Processing Skills from Scratch

A comprehensive guide to Python regular expressions, covering fundamental concepts, special characters, re module functionality, and practical text processing examples for efficient pattern matching and manipulation

Python Regular Expressions: Mastering the Art of Text Processing from Scratch

A comprehensive guide to regular expressions in Python, covering basic concepts, core features of the re module, special characters usage, and practical email matching examples

A Magical Journey of Parsing Nested Parentheses with Python Regular Expressions

A comprehensive guide on handling nested parentheses matching in Python regular expressions, covering basic single-level matching to complex multi-level nesting, with solutions using recursive regex and recursive descent parsing

Recommended

Python regex

  2024-11-12

A Magical Journey of Parsing Nested Parentheses with Python Regular Expressions
A comprehensive guide on handling nested parentheses matching in Python regular expressions, covering basic single-level matching to complex multi-level nesting, with solutions using recursive regex and recursive descent parsing
Python regex Unicode

  2024-11-08

A Complete Guide to Unicode Character Processing with Python Regular Expressions: From Basics to Mastery
A comprehensive guide to handling Unicode characters in Python regular expressions, covering basic matching, extended Unicode characters, emoji processing, Chinese character matching, and performance optimization
Python programming basics

  2024-11-04

The Complete Guide to Python Regular Expressions: From Beginner to Master, Your Ultimate Text Processing Tool
A comprehensive guide covering Python programming fundamentals, regular expressions basics, and practical applications, including detailed explanations of the re module, core syntax elements, and cross-language implementation examples