Week 9 Lecture: Advanced String Processing

This week, we focus on working with strings in Python. Strings are one of the most common data types in programming, and Python provides many built-in tools to work with them. You’ll learn how to treat strings as sequences, use string methods to clean and format data, and solve real-world problems.

1. Strings as Immutable Sequences

A sequence is an ordered collection of items. You’ve already worked with two other sequence types: lists and tuples. In Python, a string is simply a sequence of characters. This means that each character in a string has a specific position, or index, starting from 0.

Thinking of strings as sequences allows us to use familiar operations like indexing ([]), slicing ([:]), and checking for membership (in) to inspect and extract parts of a string.

However, strings have one crucial property that makes them different from lists: they are immutable. This means that once a string is created, its contents cannot be changed. Any operation that appears to “modify” a string (like converting it to lowercase) actually creates and returns an entirely new string. This is the same behavior you saw with tuples.

Syntax Overview

Operation	Syntax	Explanation
Length	`len(my_string)`	Returns the number of characters in the string.
Indexing	`my_string[index]`	Accesses the single character at the given `index`.
Find Index	`my_string.index(substring)`	Returns the index of the first occurrence of `substring`. (See Important Notes regarding errors).
Slicing	`my_string[start:end:step]`	Extracts a new substring. `start` is inclusive, `end` is exclusive.
Membership	`substring in my_string`	Returns `True` or `False` if the `substring` exists within `my_string`.

Important Notes

Immutability Error: Trying to change a character at a specific index, like my_string[0] = 'H', will cause a TypeError. You cannot modify a string “in-place”.
The Index Error: When using my_string.index("x"), if “x” is not found in the string, Python raises a ValueError and the program stops. This is different from find(), which returns -1.
Slicing Behavior: Remember that the end index in a slice is exclusive. For example, my_string[0:3] gets the characters at indices 0, 1, and 2, but not the character at index 3.

Code Example: Palindrome Checker

A palindrome is a word or phrase that reads the same forwards and backward, ignoring case and non-alphabetic characters.

Problem Statement: Write a function is_palindrome(text) that accepts a string and returns True if it’s a palindrome and False otherwise. The check must be case-insensitive and ignore all non-letter characters.
Example Cases:
- is_palindrome("A man, a plan, a canal: Panama") -> True
- is_palindrome("race a car") -> False

Complete Code:

def is_palindrome(text):
    # Step 1: Create a new, clean string with only lowercase alphabetic characters.
    cleaned_text = ""
    for char in text:
        if 'a' <= char.lower() <= 'z':
            cleaned_text += char.lower() # Build the new string

    # Step 2: Reverse the cleaned string using a slice.
    # Slicing with a step of -1 is a powerful and concise way to reverse any sequence.
    reversed_text = cleaned_text[::-1]

    # Step 3: Compare the cleaned string to its reverse.
    return cleaned_text == reversed_text

# --- Test Cases ---
print(f"'A man, a plan, a canal: Panama' is a palindrome: {is_palindrome('A man, a plan, a canal: Panama')}")
print(f"'race a car' is a palindrome: {is_palindrome('race a car')}")

Code Breakdown:
1. Immutability in Action: Because strings are immutable, we can’t just remove the punctuation and spaces from the original text. Instead, we create an empty string cleaned_text and build it up character by character, only adding the ones we want to keep.
2. Reversing with a Slice: The slice [::-1] is a classic Python idiom for reversing a sequence. It means “start at the beginning, go to the end, with a step of -1 (backwards)”. This is a very efficient way to create a reversed copy of a string.
3. The Logic: The core logic is simple: once the text is cleaned and standardized (all lowercase, no punctuation), it’s a palindrome if and only if it is identical to its reversed version.

Expected Output:

'A man, a plan, a canal: Panama' is a palindrome: True
'race a car' is a palindrome: False

2. String Transformation Methods

Python’s string objects come with many built-in methods, which are like functions that belong to an object. These methods are your primary tools for cleaning and standardizing text data, a very common task when dealing with user input or data from external sources.

Syntax of Common Methods

Method	Syntax	Explanation
Change Case	`my_string.lower()` / `my_string.upper()`	Returns a new string with all characters converted to lowercase or uppercase.
Remove Whitespace	`my_string.strip()`	Returns a new string with leading/trailing whitespace (spaces, tabs, newlines) removed.
Find & Replace	`my_string.replace(old, new)`	Returns a new string where all occurrences of the substring `old` are replaced with `new`.

Crucial Point: The Immutability Rule

Because strings are immutable, these methods never change the original string. They always return a new, modified string. A very common beginner mistake is to call a method and forget to assign its return value to a variable.

# The WRONG way - this line has no effect!
my_variable = "  Hello World  "
my_variable.strip() # A new, stripped string is created but immediately discarded
print(my_variable)  # Prints "  Hello World  "

# The CORRECT way - assign the returned value back to a variable
my_variable = "  Hello World  "
cleaned_variable = my_variable.strip()
print(cleaned_variable)  # Prints "Hello World"

Code Example: Cleaning User-Submitted Tags

Problem Statement: Write a function clean_tag(tag) that takes a messy blog tag and standardizes it by: (a) removing whitespace, (b) converting to lowercase, and (c) replacing spaces and underscores with hyphens.
Example Cases:
- clean_tag(" Python programming ") -> "python-programming"
- clean_tag("data_science") -> "data-science"

Complete Code:

def clean_tag(tag):
    # Step 1: Remove leading/trailing whitespace
    processed_tag = tag.strip()

    # Step 2: Convert to lowercase
    processed_tag = processed_tag.lower()

    # Step 3: Replace spaces with hyphens
    processed_tag = processed_tag.replace(' ', '-')

    # Step 4: Replace underscores with hyphens
    processed_tag = processed_tag.replace('_', '-')

    return processed_tag

# --- Test Cases ---
print(f"Original: '  Python programming  ' -> Cleaned: '{clean_tag('  Python programming  ')}'")
print(f"Original: 'data_science' -> Cleaned: '{clean_tag('data_science')}'")

Code Breakdown:
1. Step-by-Step Transformation: Notice how we re-assign the result of each method call back to the processed_tag variable. Each line takes the result of the previous step and applies a new transformation to it.
2. Real-World Relevance: This exact process is used to create clean, URL-friendly “slugs” for web pages, to standardize database entries, or to prepare text for analysis.

Expected Output:

Original: '  Python programming  ' -> Cleaned: 'python-programming'
Original: 'data_science' -> Cleaned: 'data-science'

3. Splitting and Joining Strings

These two methods are opposites and provide a bridge between the string and list data structures.

my_string.split(separator): Breaks a string into a list of smaller strings.
separator.join(my_list): Glues a list of strings together into a single string.

This pair is essential for parsing—taking raw text and breaking it into a structured format you can work with. For example, you use .split(',') to process data from a CSV (Comma-Separated Values) file.

Syntax

Method	Syntax	Explanation
Split	`my_string.split(separator)`	Returns a list of substrings. If `separator` is not provided, it splits on any amount of whitespace (spaces, tabs, newlines). The separator itself is not included in the list.
Join	`separator_string.join(list_of_strings)`	Returns a single string. It concatenates the elements of `list_of_strings`, with `separator_string` inserted between each element.

Key Points & Common Pitfalls

The syntax for .join() often feels “backwards”. You call the method on the separator string (the “glue”), not on the list.
- Incorrect: ['a', 'b', 'c'].join('-') will cause an AttributeError.
- Correct: '-'.join(['a', 'b', 'c']) produces "a-b-c".

Code Example: Parsing a CSV Record

Problem Statement: Write a function format_record(csv_string) that takes a string like "LastName,FirstName,Age,City" and reformats it into a human-readable summary.
Example Case:
- format_record("Doe,John,32,New York") -> "RECORD: JOHN DOE (Age: 32) is from NEW YORK."

Complete Code:

def format_record(csv_string):
    # Step 1: Split the CSV string into a list of its parts.
    parts = csv_string.split(',')

    # Step 2: Extract the data from the list using indexing.
    last_name = parts[0]
    first_name = parts[1]
    age = parts[2]
    city = parts[3]

    # Step 3: Build the final output string using an f-string and methods.
    formatted_string = f"RECORD: {first_name.upper()} {last_name.upper()} (Age: {age}) is from {city.upper()}."

    return formatted_string

# --- Test Cases ---
print(format_record("Doe,John,32,New York"))
print(format_record("lee,sun-hi,28,seoul"))

Code Breakdown:
1. Parse, Then Process: This example demonstrates a fundamental programming pattern. First, we use .split(',') to parse the raw string into a structured list (['Doe', 'John', '32', 'New York']). This makes the individual pieces of data easy to access.
2. Process the Data: Once the data is in a list, we can easily access each part by its index (parts[0], parts[1], etc.) and use other string methods like .upper() to format them as needed for our final output.

Expected Output:

RECORD: JOHN DOE (Age: 32) is from NEW YORK.
RECORD: SUN-HI LEE (Age: 28) is from SEOUL.

4. Method Chaining

Method chaining is the practice of calling multiple methods sequentially in a single line. This is possible because most string methods return a new string, so you can immediately call another method on that newly returned string.

result = my_string.method1().method2().method3()

This code is executed from left to right. The string returned by method1 becomes the object that calls method2, and so on. Chaining can make your code for simple, linear transformations more concise and readable.

Code Example: Refactoring with Method Chaining

Problem Statement: Let’s revisit our clean_tag function from before. Refactor it to perform all the transformations in a single, chained statement.

Complete Code:

def clean_tag_chained(tag):
    # Perform all the transformation steps in a single line.
    # The data flows from left to right through the chain of methods.
    return tag.strip().lower().replace(' ', '-').replace('_', '-')

# --- Test Cases (should produce identical results to the original) ---
print(f"Original: ' Machine Learning ' -> Cleaned: '{clean_tag_chained(' Machine Learning ')}'")

Code Breakdown:
1. The Data Flow: You can read the chained line as a series of steps:
  - Take the original tag…
  - …then .strip() it…
  - …then .lower() the result of that…
  - …then .replace() spaces in the result of that…
  - …and finally return the final result.
2. Readability: For a simple pipeline of transformations like this, chaining is very clean and “Pythonic”. However, for more complex logic, breaking the steps into separate lines with intermediate variables can make the code easier to read and debug. Choose the approach that results in the clearest code.

Expected Output:

Original: ' Machine Learning ' -> Cleaned: 'machine-learning'

Week 9 Practice Problems: String Processing

These problems are designed to test your understanding of this week’s string manipulation concepts. They will require you to combine string methods with fundamental concepts you’ve already learned, such as loops, conditional logic, and lists.

Try to solve each problem on your own before discussing solutions. Focus on breaking the problem down into smaller, logical steps.

Problem 1: SKU Validator

Scenario: You are writing software for a warehouse inventory system. A critical part of the system is validating product Stock Keeping Units (SKUs) to ensure they are formatted correctly before being entered into the database.

Task: Write a function validate_sku(sku_code) that checks if a given SKU string is valid according to a strict format.

Rules for a valid SKU:

The SKU must be in the format CATEGORY-PRODUCT_ID-SIZE. The three parts must be separated by hyphens (-).
The validation should ignore any leading or trailing whitespace and be case-insensitive (e.g., " elc-..." is treated the same as "ELC-...").
The SKU must contain exactly two hyphens after cleaning.
Category: The CATEGORY part must be exactly 3 letters long and must be one of the following codes: 'ELC' (Electronics), 'GRC' (Grocery), or 'HHL' (Household).
Product ID: The PRODUCT_ID part must be exactly 6 characters long, and all characters must be digits (0-9).
Size: The SIZE part must be one of the following single characters: 'S', 'M', or 'L'.

Function Specification:

Name: validate_sku
Parameter: sku_code (a string)
Returns: A tuple containing two values:
1. A boolean: True if the SKU is valid, False otherwise.
2. A string:
  - If valid, the function should return the cleaned, uppercase SKU string.
  - If invalid, it should return a descriptive error message explaining the first rule that failed.

Example Cases to Test:

# Expected return value: (True, 'ELC-123456-S')
validate_sku("  elc-123456-S  ")

# Expected return value: (True, 'GRC-987654-L')
validate_sku("GRC-987654-L")

# Expected return value: (False, 'Error: Product ID must be 6 digits.')
validate_sku("HHL-12345-M") # Product ID is too short

# Expected return value: (False, 'Error: Invalid category code.')
validate_sku("FOD-112233-L") # 'FOD' is not a valid category

# Expected return value: (False, 'Error: SKU format must be CATEGORY-PRODUCT_ID-SIZE.')
validate_sku("GRC_112233_L") # Uses underscores instead of hyphens

# Expected return value: (False, 'Error: Invalid size code.')
validate_sku("GRC-112233-X") # 'X' is not a valid size

Problem 2: URL Slug Generator

Scenario: You are building a content management system (CMS) for a blog. When a writer creates a post with a title like “A Beginner’s Guide to Python!”, the system needs to automatically generate a URL-friendly version of that title, called a “slug”.

Task: Write a function generate_slug(title, max_length) that converts a blog post title into a URL-friendly slug.

Slug Generation Rules:

The conversion should be case-insensitive and ignore leading/trailing whitespace.
Remove any character that is not a lowercase letter (a-z), a digit (0-9), or a space. You can start by just removing common punctuation like . , ! ? ' :.
All spaces in the cleaned title should be replaced with a single hyphen (-).
The final slug must not be longer than max_length. If it is, it must be truncated.
Truncation Rule: The slug must not be cut off in the middle of a word. It should be truncated at the last hyphen that occurs at or before the max_length.

Function Specification:

Name: generate_slug
Parameters:
1. title (a string): The original blog post title.
2. max_length (an integer): The maximum allowed length of the slug.
Returns: The generated slug string.

Example Cases to Test:

# Expected return value: "a-beginners-guide-to-python"
generate_slug("  A Beginner's Guide to Python! ", 30)

# Expected return value: "10-common-mistakes-in-python-and-how-to-fix-them"
generate_slug("10 Common Mistakes in Python -- And How to Fix Them", 50)

# Expected return value: "data-structures"
# The full slug "data-structures-lists-tuples-and-dictionaries" is too long.
# The last hyphen before character 20 is after "structures".
generate_slug("Data Structures: Lists, Tuples, and Dictionaries", 20)

# Expected return value: "a-very-long-title-that"
generate_slug("A Very Long Title That Will Definitely Need To Be Truncated", 25)

Problem 3: Word Frequency Counter (List of Lists Edition)

Scenario: You are performing a basic text analysis on a document. A common first step is to count the frequency of each word. While a dictionary is the ideal tool for this (which we’ll learn about next week!), it’s an excellent exercise to solve this problem using only the tools you have now: lists and strings.

Task: Write a function count_word_frequencies(text) that takes a string of text, counts the occurrence of each word, and returns the result as a list of lists, sorted alphabetically by word.

Requirements:

The function should take a single string argument, text.
The counting must be case-insensitive (e.g., “The” and “the” are the same word).
Before counting, all common punctuation marks (specifically ., ,, !, ?) should be removed from the text.
The function must return a list of lists. Each inner list should contain two items: the word (string) and its count (integer). For example: [ [word1, count1], [word2, count2], ... ].
The final list of lists must be sorted alphabetically by word.

Function Specification:

Name: count_word_frequencies
Parameter: text (a string)
Returns: A list of lists, sorted alphabetically by word, representing the word counts.

Example Cases to Test:

# Example 1
text_block_1 = "The cat sat on the mat. The cat was happy."
# Expected return value:
# [['cat', 2], ['happy', 1], ['mat', 1], ['on', 1], ['sat', 1], ['the', 3], ['was', 1]]
count_word_frequencies(text_block_1)

# Example 2
text_block_2 = "The quick brown fox jumps over the lazy dog. The dog was not lazy, it was just resting."
# Expected return value:
# [['brown', 1], ['dog', 2], ['fox', 1], ['it', 1], ['jumps', 1], ['just', 1], ['lazy', 2], ['not', 1], ['over', 1], ['quick', 1], ['resting', 1], ['the', 3], ['was', 2]]
count_word_frequencies(text_block_2)

# Example 3
text_block_3 = "Go, Dog. Go!"
# Expected return value:
# [['dog', 1], ['go', 2]]
count_word_frequencies(text_block_3)