Regular Expressions


What is RegEX ?

RegEx stands for Regular Expressions. It’s a tool used to define search patterns within text. These patterns help you find, match, extract, or even validate specific sequences of characters—such as email addresses, phone numbers, or keywords—in a fast and flexible way.

Think of RegEx as a mini-language made up of special symbols and rules that let you describe exactly what kind of text you’re looking for.

Where Does RegEx Come From ?

To truly understand RegEx at its core, we’d need to explore concepts from Automata Theory, a part of the Theory of Computation (TOC). This field introduces foundational ideas such as:

  • Alphabets: The set of symbols you work with (like letters, numbers, or characters).
  • Languages: Collections of strings formed using these alphabets.
  • Finite Automata: Simple machines that process input based on patterns—these are the theoretical roots of RegEx.

But don’t worry—you don’t need a deep computer science background to start using RegEx effectively.

Why learn Regex ?

Whether you’re a developer, a data analyst, a hacker, or just someone who works with a lot of text, RegEx can save you hours of manual searching and editing. It’s used in programming, text editors, command-line tools, and even browser find functions.

Getting Started

Regular expressions don’t work the exact same way on every platform. There isn’t one perfect version that fits all. Different tools and languages—like Python, JavaScript, .NET, or POSIX—have their own rules and features. Some advanced parts, like lookbehinds or named groups, may work differently or not at all in some systems.

However, the basic symbols, like . for any character, * for zero or more times, and things like +, ?, and [], usually work the same almost everywhere. Therefore, these symbols have been presented in an tabular form below

Basic Matching

S.NoPatternFunction
1.Matches any single character except a newline
2^Matches the start of a line
3$Matches the end of a line
4[ ]Matches any character inside the brackets
5[^]Matches any character not inside the brackets
6|Logical OR

Quantifiers

Applies to the preceding element

S.NoPatternFunction
1*Zero or more occurrences
2+One or more occurrences
3?Zero or one occurrence
4{n}Matches exactly n occurrences
5{n,}Matches n or more occurrences
6{n,m}Matches between n and m occurrences

Metacharacters

S.NoPatternFunction
1\Escapes a special character
2( )Groups patterns together and captures the matched text
Eg: (ab)+ matches “ab”, “abab”, “ababab” and so on
3\wMatches any word character (alphanumeric or underscore)
4\dMatches any digit character
5\sMatches any whitespace character
6\bMatches a word boundary

Examples Section

Pattern to match a string that starts with “h”

Condition 1: The string must start with the letter h. We will use the caret ^ to anchor the match at the start of the string, giving us the pattern ^h.
Condition 2: The rest of the string can be of any length. We will match zero or more occurrence of any character after h using the pattern \w*.

Combining the two conditions we get ^h\w* as the final pattern

Note:

When it comes to regular expressions there is no single correct answer. There are multiple methods to achieve the same output. Another valid answer would be ^h\w+. However it is to be noted that ^h\w* and ^h\w+ are not the same, both can be accepted as valid solutions because of the ambiguous nature of the question asked

Pattern to match all 3 digit numbers

Condition 1: The string must contain only digits. We will use \d to match any digit.
Condition 2: The string has exactly 3 digits. We will use {3} to represent exactly 3 occurrences.

Combining the two conditions we get \d{3} as the final pattern

Pattern to match a valid email ID

Condition 1: The string must have a username part at the beginning. We will use the pattern \w+ for this.
Condition 2: An @ symbol follows the username. In this case, we can use the character itself as the pattern.
Condition 3: A domain name follows the @. We can again use \w+ for this.
Condition 4: A dot ( . ) follows the domain name. Match the same using the character itself; however dot already has a special meaning in regular expression which can lead to a conflict; avoid this by escaping the dot via a backslash as \.
Condition 5: A TLD (Top Level Domain) follows the dot. We can use once again use \w+ for this.

Combining the 5 conditions we get \w+@\w+\.\w+ as the final pattern