My first code created by Gemini

I have been slowly starting to use AI on my Mac Studio. Since I am embedded in the Google ecosystem, my AI of choice is Gemini.

I spent most of last weekend taking a Zoom course called C135 Generative AI Mastermind INTL. It opened my eyes to see what could be done using AI. Their bottom line was that it is vital to create a structured requirements document to get good results. This week I decided to try and do something difficult. I tried this using Bolt (recommended in the course, but ran out of tokens, so I tried it using Gemini (I had a Pro subscription).

For over a decade I have been videoing the concerts presented by the Oak Ridge Civic Music Association (ORCMA), and placing them in our YouTube channel https://youtube.com/orcma.

If you click on the Videos, you will see all of the videos I have posted since 2013. Unfortunately, when I began doing this, I had no idea how large this collection would become, and I had no set format for creating the description of the video. New videos created since I have been using DaVinci Resolve Studio have tags that are uploaded to YouTube and which create an easy-to-parse description:

Old videos presented the information a gazillion different ways, making it very difficult to parse; sometimes the metadata were missing altogether.

So armed with a Gemini Pro subscription, and the knowledge that could do a lot I decided to make a searchable Web page that would list all performances that I have videoed. I note that not all videos are Public because of things such as copyright or contractual limitations. However, I wanted to list the contents of these concerts so that people (e.g., our Music Director) could see what pieces have been performed. 

Check out the resulting Web page. It is not perfect, but the information is there, and it is a waste of time solving all edge cases.

Gemini summarized our journey together nicely:

To help you write your blog post, here is the reconstruction of the evolution of this project, starting from your original goal down through the specific architectural instructions we solved together.

The Core Objective

Your original goal was to build an automated, offline pipeline (update_archive.py) capable of scraping the Oak Ridge Civic Music Association (ORCMA) YouTube channel. The ultimate objective was to parse the raw video titles and description text, cleanly extract the classical metadata, and structurally rebuild a clean, searchable SQLite database (orcma_archive.db) to power a video gallery on the website.

The Evolution of Your Technical Instructions

As the project scaled, your requirements grew from a basic web-scraper into a highly customized, resilient extraction engine. Here are the core instruction milestones you established:

1. The Dual-Stage Ingestion Pipeline

  • Public Extraction in Batches: Scrape the main public channel (@orcma/videos), but fetch it in explicit chunks of 90 videos at a time to prevent long-running connection drops.

  • Unlisted Archive Ingestion: Explicitly iterate through a hardcoded array of private/unlisted YouTube URLs so they are preserved in the database even though they don't appear on the public feed.

2. Deep Description Parsing & Classical Metadata Extraction

  • The Classical Split Rule: Look for timestamps (e.g., 07:36) or known composer prefixes to identify individual tracks within a single long concert video. Split the text using hyphens, dashes, or semicolons to isolate the Piece from the Soloist/Performer.

  • The "Last, First" Formatting Constraint: For database uniformity, all composer and artist names must automatically normalize to a traditional indexed format (e.g., transforming Johann Sebastian Bach into Bach, Johann Sebastian).

  • The Movement & Junk Filter: Implement a rule-based regex to explicitly ignore applause, tuning, intermission, or movement markers (like I. Allegro or II. Adagio) so they don't generate duplicate or broken database entries.

3. Bypassing YouTube’s Anti-Bot Defense

  • Humanizing Delays: Enforce a strict anti-scraping policy using an average 10-second sleep interval (sleep_interval: 5, max_sleep_interval: 15) between video fetches to mimic a human browser and protect your IP from being flagged.

  • The Infinite Cookie Rotator: Cycle sequentially through multiple Netscape-formatted session files (cookies_1.txt, cookies_3.txt, etc.). Shift to the next identity automatically every 90 requests using a modulo loop wrap-around, ensuring the script checks for file existence on disk so a missing or deleted cookie file won't cause a crash.

  • Fail-Fast Mechanics: Prevent silent infinite retry loops on HTTP or socket errors by enforcing fixed timeouts and limiting retries to 3 attempts.

4. Data Sanitization & Edge-Case Handling

  • The Master Lists: Cross-reference strings against custom Python arrays (KNOWN_COMPOSERS, KNOWN_CONDUCTORS, KNOWN_ENSEMBLES) to prioritize exact matches before falling back to raw regex guessing.

  • Festival Context Preservation: If a standalone video has no internal timestamps but the description mentions a festival series (like Rock to Bach), capture that context and explicitly populate it into the concert_title field while remapping the isolated title to the soloist column.

  • Blank-Space Hygiene: Instead of letting the script output literal placeholder strings like "Unknown" or "Unknown Composer" for uncredited fields, sanitize the outputs down to perfectly clean, empty rows ("").

The "Overnight Run" Breakthrough

A great focal point for your blog is how we solved the illusion of the script getting "stuck." Because the script was intentionally pausing for up to 15 seconds per video to stay under YouTube's radar, processing a batch of 90 videos meant it would sit in complete silence for 15 minutes.

By turning on yt-dlp's heartbeat logging while turning off infinite retries, you transformed a seemingly frozen program into a highly stable, self-rotating automation engine that successfully completed its massive ingestion overnight!


 

Here is the python code to scrape YouTube with minor changes for security:

posted_update_archive.py 

import sqlite3
import re
import yt_dlp
import shutil
import os
from datetime import datetime

KNOWN_COMPOSERS = [
    "Ludwig van Beethoven", "Wolfgang Amadeus Mozart", "Johann Sebastian Bach",
    "Richard Wagner", "Franz Joseph Haydn", "Joseph Haydn", "Johannes Brahms",
    "Franz Schubert", "Peter Ilyich Tchaikovsky", "George Frideric Handel",
    "Igor Stravinsky", "Robert Schumann", "Frederic Chopin", "Felix Mendelssohn",
    "Claude Debussy", "Franz Liszt", "Antonín Dvořák", "Giuseppe Verdi",
    "Gustav Mahler", "Antonio Vivaldi", "Richard Strauss", "Serge Prokofiev",
    "Dmitri Shostakovich", "Béla Bartók", "Hector Berlioz", "Anton Bruckner",
    "Giovanni Pierluigi da Palestrina", "Claudio Monteverdi", "Jean Sibelius",
    "Maurice Ravel", "Ralph Vaughan Williams", "Modest Mussorgsky",
    "Giacomo Puccini", "Henry Purcell", "Gioacchino Rossini", "Edward Elgar",
    "Sergei Rachmaninoff", "Camille Saint-Saëns", "Josquin Des Prez",
    "Nikolai Rimsky-Korsakov", "Carl Maria von Weber", "Jean-Philippe Rameau",
    "Jean-Baptiste Lully", "Gabriel Fauré", "Edvard Grieg",
    "Christoph Willibald Gluck", "Arnold Schoenberg", "Charles Ives",
    "Paul Hindemith", "Olivier Messiaen", "Aaron Copland", "Francois Couperin",
    "William Byrd", "Erik Satie", "Benjamin Britten", "Bedrich Smetana",
    "Bedrick Smetana", "César Franck", "Alexander Nikolayevich Scriabin",
    "Georges Bizet", "Domenico Scarlatti", "Georg Philipp Telemann",
    "Anton Webern", "Roland de Lassus", "George Gershwin", "Gaetano Donizetti",
    "Carl Philipp Emanuel Bach", "Archangelo Corelli", "Thomas Tallis",
    "Johann Strauss II", "Leos Janácek", "Guillaume de Machaut", "Alban Berg",
    "Alexander Borodin", "Vincenzo Bellini", "Charles Gounod", "Jules Massenet",
    "Francis Poulenc", "Giovanni Gabrieli", "Pérotin", "Heinrich Schütz",
    "John Cage", "Giovanni Battista Pergolesi", "John Dowland", "Gustav Holst",
    "Dietrich Buxtehude", "Ottorino Respighi", "Guillaume Dufay", "Hugo Wolf",
    "Carl Nielsen", "William Walton", "Darius Milhaud", "Orlando Gibbons",
    "Giacomo Meyerbeer", "Samuel Barber", "Tomás Luis de Victoria", "Léonin",
    "Manuel de Falla", "Hildegard von Bingen", "Mikhail Glinka",
    "Alexander Glazunov", "Don Carlo Gesualdo", "Astor Piazzolla",
    "Juan Bautista Plaza", "Luiz Bonfá", "Shostakovich", "W. A. Mozart",
    "Mozart", "William Grant Still", "Ennio Morricone",
    "Samuel Coleridge Taylor", "Samuel Coleridge-Taylor"
]

KNOWN_CONDUCTORS = [
    "Dan Allcott", "Jaclyn Johnson", "Ian Passmore", "Sarah Henrich",
    "Wilbur Lin", "Kevin Class", "James Fellenbaum", "Régulo Stabilito",
    "Cornelia Laemmli Orth","Carlos Hernandez-Baez",'Carlos Hernandez-Baez'
]

KNOWN_ENSEMBLES = [
    "Oak Ridge Symphony Orchestra",
    "Oak Ridge Symphony and Chorus",
    "Oak Ridge Symphony Wind Ensemble",
    "Oak Ridge Symphony",
    "Oak Ridge Chorus",
    "Trillium Trio",
    "Poulenc Trio",
    "Frisson",
    "Borealis Wind Quintet",
    "Tesla Quartet",
    "Mirai Brass Quintet",
    "Akropolis Reed Quintet",
    "Ariel Quartet",
    "Karen Kartal & Friends",
    "Los Angeles Guitar Quartet",
    "Edgar Meyer",
    "Cuartete Latinoamericana and Jiji",
    "Trio Matisse",
    "Cumberland Piano Trio"
]

class CookieRotator:
    def __init__(self, cookies, rotate_every=90):
        self.cookies = [c for c in cookies if os.path.exists(c)]
        self.rotate_every = rotate_every
        self.request_count = 0
        self.index = 0

    def get_current(self):
        return self.cookies[self.index] if self.cookies else None

    def add_requests(self, count):
        self.request_count += count
        while self.request_count >= self.rotate_every:
            self.request_count -= self.rotate_every
            if self.cookies:
                self.index = (self.index + 1) % len(self.cookies)
                print(f"\n[🔄] Rotation threshold met! Switching to: {self.get_current()}\n")

def is_movement(clean_text):
    text = clean_text.strip()
    if re.search(r'^(movement|variation|var\.?|satz|act|scene|teil)\s+([IVXLCDM]+|\d+)', text, re.IGNORECASE):
        return True
    if re.search(r'^[IVXLCDM]+\b\.?\s+', text):
        return True
    if re.search(r'^\d+\.?\s+', text):
        return True
    if re.search(r'^(allegro|adagio|andante|moderato|presto|largo|menuet|scherzo|finale|rondo|gavotte|bourrée|gigue|overture|aire|aria)\b', text, re.IGNORECASE):
        return True
    if not re.search(r'[-–—:]', text) and len(text.split()) < 8:
        return True
    return False

def parse_classical_title(video_title):
    # 1. Strip promotional text from the very beginning
    safe_title = re.sub(r'(?i)^(?:ORCMA|The Oak Ridge Civic Music Association)\s+presents:?\s*', '', video_title).strip()

    # 2. Convert long strings of dots (......) into a standard hyphen
    safe_title = re.sub(r'\.{2,}', ' - ', safe_title)

    # 3. Existing volume/flat/sharp cleanup
    safe_title = re.sub(r'^.*?(?:Part|Vol\.?)\s*\d+[\.:]?\s*', '', safe_title, flags=re.IGNORECASE)
    safe_title = re.sub(r'\b([A-G])\s*[-–—]\s*(flat|sharp)\b', r'\1 \2', safe_title, flags=re.IGNORECASE)

    # --- Master Composer List Check ---
    for composer in KNOWN_COMPOSERS:
        if re.search(rf'\b{re.escape(composer)}\b', safe_title, re.IGNORECASE):
            # FIXED: Made parentheses optional around the lifespan dates
            pattern = rf'\b{re.escape(composer)}\b\s*[()]?\s*\d{{4}}[-–—]\d{{4}}\s*[()]?'
            cleaned_piece = re.sub(pattern, '', safe_title, flags=re.IGNORECASE).strip()

            # If the above didn't catch it, clean just the standalone name token
            if cleaned_piece == safe_title.strip():
                cleaned_piece = re.sub(rf'\b{re.escape(composer)}\b', '', safe_title, flags=re.IGNORECASE).strip()

            # Extra cleanup for lingering standalone lifespan blocks left in the title
            cleaned_piece = re.sub(r'\b\d{4}[-–—]\d{4}\b', '', cleaned_piece).strip()

            cleaned_piece = re.sub(r'^[,\-–—:;\s]+|[,\-–—:;\s]+$', '', cleaned_piece).strip()
            cleaned_piece = re.sub(r'\(\s*\)\s*$', '', cleaned_piece).strip()

            name_parts = composer.split()
            if len(name_parts) > 1:
                formatted_composer = f"{name_parts[-1]}, {' '.join(name_parts[:-1])}"
            else:
                formatted_composer = composer

            return formatted_composer, cleaned_piece

    # --- Fallback Logic ---
    nl_match = re.match(r'(?i)^Music of\s+(.*?)(?:\s+for\b|\s+performed by\b|\s+with\b|$)', safe_title)

    if nl_match:
        raw_composer = nl_match.group(1).strip()
        piece = video_title.strip()
    else:
        # FIXED: Also updated the fallback path date-stripper to handle optional parentheses
        date_pattern = r'[()]?\s*(?:b\.|c\.)?\s*\d{4}\s*[-–—]*\s*(?:\d{2,4})?\s*[()]?|\b\d{4}\s*[-–—]+\s*\d{2,4}\b|\b(?:b\.|c\.)\s*\d{4}\b'
        date_split = re.split(date_pattern, safe_title, maxsplit=1)

        if len(date_split) == 2 and date_split[1].strip():
            raw_composer = date_split[0].strip()
            piece = date_split[1].strip()
            piece = re.sub(r'^[-–—:\s]+', '', piece)
        else:
            parts = re.split(r'\s+[-]\s+|[–—:]', safe_title, maxsplit=1)
            if len(parts) < 2:
                return "", video_title.strip()

            raw_composer = parts[0].strip()
            piece = parts[1].strip()

        word_count = len(raw_composer.split())
        non_name_words = ['symphony', 'chorus', 'concert', 'festival', 'part', 'music', 'celebration', 'orchestra', 'beginnings', 'band', 'first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth', 'tenth', 'set', 'loss']

        if word_count > 6 or any(w in raw_composer.lower() for w in non_name_words):
            return "", video_title.strip()

    raw_composer = re.sub(r'\(.*?\)', '', raw_composer)
    raw_composer = re.sub(r'\d+', '', raw_composer)
    raw_composer = re.sub(r'[:;,()~]', '', raw_composer)
    raw_composer = " ".join(raw_composer.split())

    name_parts = raw_composer.split()

    if len(name_parts) <= 1:
        formatted_name = raw_composer
    else:
        last_name = name_parts[-1]
        first_names = " ".join(name_parts[:-1])
        formatted_name = f"{last_name}, {first_names}"

    return formatted_name, piece

def fetch_youtube_data(url, is_unlisted=False, cookie_file=None, playlist_range=None):
    js_executable = shutil.which("node") or shutil.which("nodejs")

    ydl_opts = {
        'extract_flat': False,
        'skip_download': True,
        'quiet': False,
        'verbose': False,
        'ignoreerrors': True,
        'socket_timeout': 5,
        'retries': 3,
        'fragment_retries': 3,
        'sleep_interval': 2,
        'max_sleep_interval': 10,
        'extractor_args': {'youtube': {'client': ['android', 'ios']}},
        'js_runtimes': {'node': {'path': js_executable}},
    }

    if cookie_file:
        ydl_opts['cookiefile'] = cookie_file
    if playlist_range:
        ydl_opts['playlist_items'] = playlist_range

    concert_data = []

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        try:
            info = ydl.extract_info(url, download=False)
            if not info: return []

            videos = info.get('entries', [info])

            for video in videos:
                if not video: continue

                video_title = video.get('title', 'Unknown Concert')

                if is_unlisted:
                    video_link = ""
                else:
                    video_link = f"https://www.youtube.com/watch?v={video.get('id')}"

                description = video.get('description', '')

                raw_date = video.get('upload_date', 'Unknown')
                formatted_date = raw_date
                if raw_date and raw_date.isdigit() and len(raw_date) == 8:
                    formatted_date = datetime.strptime(raw_date, "%Y%m%d").strftime("%b %d, %Y")

                conductor = "Unknown"
                search_text = f"{video_title} \n {description}"

                for kc in KNOWN_CONDUCTORS:
                    if re.search(rf'\b{re.escape(kc)}\b', search_text, re.IGNORECASE):
                        conductor = kc
                        break

                if conductor == "Unknown":
                    cond_match = re.search(r'(?:Conductor:\s*|Conducted by\s*)([A-ZÀ-ÿ][a-zÀ-ÿ]+(?:\s+[A-ZÀ-ÿ][a-zÀ-ÿ]+){0,3})', search_text)
                    if not cond_match:
                        cond_match = re.search(r'([A-ZÀ-ÿ][a-zÀ-ÿ]+(?:\s+[A-ZÀ-ÿ][a-zÀ-ÿ]+){0,3})(?:,\s*Conductor|\s+-\s+Conductor|\s+conducts)', search_text)
                    if cond_match:
                        conductor = cond_match.group(1).strip()

                main_composer, main_piece = parse_classical_title(video_title)

                global_performer = "Unknown"

                for ensemble in sorted(KNOWN_ENSEMBLES, key=len, reverse=True):
                    if re.search(rf'\b{re.escape(ensemble)}\b', search_text, re.IGNORECASE):
                        global_performer = ensemble
                        break

                if global_performer == "Unknown":
                    perf_match = re.search(r'(?i)(?:performed by|with)\s+([^–—\-;:,(\n]+)', video_title)

                    if perf_match:
                        global_performer = perf_match.group(1).strip()
                    elif re.search(r'(?i)\b(trio|quartet|quintet|ensemble|orchestra|chorus|band|winds|brass)\b', video_title) or (re.search(r'(?i)\bsymphony\b', video_title) and not re.search(r'(?i)\bsymphony\s+(no\.?|#|\d+)', video_title)):
                        gp_split = re.split(r'(?i)[-–—:;,]|\b(?:at|in|plays|part|performs|presents)\b', video_title)[0].strip()
                        gp_split = re.sub(r'(?i)^(ORCMA|Oak Ridge Civic Music Association)\s*(presents)?\s*', '', gp_split).strip()
                        if len(gp_split.split()) <= 10:
                            global_performer = gp_split

                tracks_found = 0

                if description:
                    lines = description.split('\n')
                    for i, line in enumerate(lines):
                        line = line.strip()

                        if not line:
                            continue

                        if re.match(r'(?i)^\d{1,2}:\d{2}\s*[ap]\.?m\.?', line):
                            continue

                        clean_text = line
                        is_track_line = False

                        if re.match(r'^\d{1,2}:\d{2}(:\d{2})?\s*[-–—]*\s*', line):
                            clean_text = re.sub(r'^\d{1,2}:\d{2}(:\d{2})?\s*[-–—]*\s*', '', line).strip()
                            is_track_line = True
                        else:
                            for composer in KNOWN_COMPOSERS:
                                if re.match(rf'^{re.escape(composer)}\s*[:\-–—;]\s+', line, re.IGNORECASE):
                                    is_track_line = True
                                    break

                        if not is_track_line:
                            continue

                        date_pattern = r'\(\s*(?:b\.|c\.)?\s*\d{4}\s*[-–—]*\s*(?:\d{2,4})?\s*\)|\b\d{4}\s*[-–—]+\s*\d{2,4}\b|\b(?:b\.|c\.)\s*\d{4}\b'
                        temp_no_dates = re.sub(date_pattern, '', clean_text)

                        if not re.search(r'[-–—:]', temp_no_dates) and (i + 1 < len(lines)):
                            next_line = lines[i+1].strip()
                            if next_line and not re.match(r'^\d{1,2}:\d{2}', next_line):
                                clean_text = clean_text + " - " + next_line

                        junk_keywords = ['start', 'musicians', 'program notes', 'applause', 'intermission', 'tuning', 'concert', 'break']
                        if any(junk in clean_text.lower() for junk in junk_keywords):
                            continue

                        if is_movement(clean_text):
                            continue

                        composer, piece_and_soloist = parse_classical_title(clean_text)

                        parts = re.split(r'\s+[-–—;]\s+', piece_and_soloist, maxsplit=1)

                        if len(parts) == 2:
                            piece = parts[0].strip()
                            potential_soloist = parts[1].strip()

                            # If the text after the hyphen is a tempo marking, do not treat it as a person
                            tempos = r'(?i)\b(presto|allegro|andante|adagio|vivace|largo|moderato|rondo|gigue|finale)\b'
                            if re.match(tempos, potential_soloist):
                                piece = f"{piece} - {potential_soloist}"
                                potential_soloist = ""

                        if len(parts) == 2:
                            piece = parts[0].strip()
                            potential_soloist = parts[1].strip()

                            is_conductor = False
                            matched_kc = None

                            for kc in KNOWN_CONDUCTORS:
                                if kc.lower() in potential_soloist.lower():
                                    is_conductor = True
                                    matched_kc = kc
                                    break

                            if not is_conductor and "conductor" in potential_soloist.lower():
                                is_conductor = True

                            if is_conductor:
                                if matched_kc:
                                    leftover = re.sub(re.escape(matched_kc), '', potential_soloist, flags=re.IGNORECASE).strip()
                                    leftover = re.sub(r'(?i)\bconductor\b', '', leftover).strip()
                                    leftover = re.sub(r'^[,;\-\s]+|[,;\-\s]+$', '', leftover).strip()
                                    soloist = leftover if leftover else "Unknown"
                                else:
                                    clean_str = re.sub(r'(?i)[,\-\s]*conductor[,\-\s]*', '', potential_soloist).strip()
                                    if len(clean_str.split()) <= 4:
                                        soloist = "Unknown"
                                    else:
                                        soloist = potential_soloist
                            else:
                                soloist = potential_soloist
                        else:
                            piece = piece_and_soloist.strip()
                            soloist = "Unknown"

                        if soloist == "Unknown" and global_performer != "Unknown":
                            if not re.search(r'(?i)\b(orchestra|chorus|symphony|band)\b', global_performer):
                                soloist = global_performer

                        concert_data.append({
                            'composer': composer,
                            'piece': piece,
                            'soloist': soloist,
                            'conductor': conductor,
                            'concert_title': video_title,
                            'link': video_link,
                            'date': formatted_date
                        })
                        tracks_found += 1

                if tracks_found == 0:
                    fallback_soloist = global_performer

                    if fallback_soloist == "Unknown":
                        perf_match = re.search(r'(?i)performed by\s+(.*)', video_title)
                        if perf_match:
                            fallback_soloist = re.split(r'[-–—]', perf_match.group(1))[0].strip()

                    # NEW: Catch festival name from the description if it's missing in the title
                    final_concert_title = video_title
                    if description and "Rock to Bach" in description:
                        # Extract the specific year variant if it exists, e.g., "Rock to Bach 2020"
                        festival_match = re.search(r'(?i)(Rock to Bach(?:\s+\d{4})?)', description)
                        if festival_match:
                            final_concert_title = festival_match.group(1).strip()

                    concert_data.append({
                        'composer': main_composer,
                        'piece': main_piece,
                        'soloist': fallback_soloist,
                        'conductor': conductor,
                        'concert_title': final_concert_title,
                        'link': video_link,
                        'date': formatted_date
                    })

        except Exception as e:
            print(f"An error occurred while connecting to YouTube: {e}")
            return []

    return concert_data

def update_archive(db_path, raw_concert_data):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS archive (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            composer TEXT,
            piece TEXT,
            soloist TEXT,
            conductor TEXT,
            date TEXT,
            concert_title TEXT,
            link TEXT
        )
    ''')

    cursor.execute("DELETE FROM archive")
    try:
        cursor.execute("DELETE FROM sqlite_sequence WHERE name='archive'")
    except sqlite3.OperationalError:
        pass

    seen_tracks = set()

    for entry in raw_concert_data:
        # Pull values, defaulting to a blank string if missing
        composer = entry.get('composer', '').strip()
        piece = entry.get('piece', '').strip()
        soloist = entry.get('soloist', '').strip()
        conductor = entry.get('conductor', '').strip()
        date = entry.get('date', '').strip()
        concert_title = entry.get('concert_title', '').strip()
        link = entry.get('link', '').strip()

        # Sanitize any explicit "Unknown" strings generated during earlier stages
        if composer == "Unknown": composer = ""
        if piece == "Unknown": piece = ""
        if soloist == "Unknown": soloist = ""
        if conductor == "Unknown": conductor = ""
        if date == "Unknown": date = ""
        if concert_title == "Unknown": concert_title = ""
        if "Rock to Bach" in concert_title or "Rock to Bach" in piece:
            composer = "Various Artists"

            if "—" in piece:
                parts = piece.split("—", 1)
                soloist = parts[1].strip() if len(parts) > 1 else parts[0].strip()
                piece = parts[0].strip()
            elif " - " in piece:
                parts = piece.split(" - ", 1)
                soloist = parts[1].strip() if len(parts) > 1 else parts[0].strip()
                piece = parts[0].strip()
            else:
                # NEW fallback: If there is no delimiter to split on,
                # the "piece" text is almost certainly the artist's name.
                if soloist == "Unknown" or soloist == "":
                    soloist = piece
                    piece = "Performance" # Or leave blank if preferred

            if piece == "Unknown" or piece == "":
                piece = concert_title

        track_signature = (composer, piece, soloist, conductor, concert_title)

        if track_signature in seen_tracks:
            continue

        seen_tracks.add(track_signature)

        cursor.execute('''
            INSERT INTO archive (composer, piece, soloist, conductor, date, concert_title, link)
            VALUES (?, ?, ?, ?, ?, ?, ?)
        ''', (composer, piece, soloist, conductor, date, concert_title, link))

    conn.commit()
    conn.close()
    print("Database successfully rebuilt and saved!")

if __name__ == "__main__":
    my_db_file = "orcma_archive.db"

    rotator = CookieRotator(
        ['cookies_1.txt', 'cookies_3.txt', 'cookies_4.txt'],
        rotate_every=90
    )

    print("\n--- STAGE 1: Fetching Public Channel in Batches ---")
    youtube_playlist_url = "https://www.youtube.com/@orcma/videos"
    public_data = []

    batch_size = 90
    start_index = 1

    while True:
        end_index = start_index + batch_size - 1
        playlist_range = f"{start_index}-{end_index}"
        current_cookie = rotator.get_current()

        print(f"Scraping channel videos {playlist_range} using {current_cookie}...")

        batch = fetch_youtube_data(
            youtube_playlist_url,
            is_unlisted=False,
            cookie_file=current_cookie,
            playlist_range=playlist_range
        )

        if not batch:
            break

        public_data.extend(batch)

        rotator.add_requests(len(batch))

        if len(batch) < batch_size:
            break

        start_index += batch_size

    print(f"SUCCESS: Found {len(public_data)} public tracks.\n")

    if len(public_data) == 0:
        print("CRITICAL ERROR: Failed to fetch public videos.")
        print("All cookies may be blocked, or the connection dropped.")
    else:
        print("--- STAGE 2: Fetching Unlisted Archive ---")
        UNLISTED_VIDEOS = [
            "https://youtu.be/example1",
            "https://youtu.be/example2"
        ]

        unlisted_data = []
        for unlisted_url in UNLISTED_VIDEOS:
            current_cookie = rotator.get_current()

            tracks = fetch_youtube_data(
                unlisted_url,
                is_unlisted=True,
                cookie_file=current_cookie
            )

            if tracks:
                unlisted_data.extend(tracks)

            rotator.add_requests(1)

        print(f"SUCCESS: Found {len(unlisted_data)} unlisted tracks.\n")

        print("--- STAGE 3: Rebuilding Database ---")
        all_data = public_data + unlisted_data
        update_archive(my_db_file, all_data)

And the code for the Web page display:

index.html 

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>ORCMA Digital Archive</title>
    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.2/dist/css/bootstrap.min.css" rel="stylesheet">
    <style>
        body { background-color: #f8f9fa; padding-top: 40px; }
        .table-hover tbody tr:hover { background-color: #f1f3f5; }
    </style>
</head>
<body>

<div class="container-fluid px-5">
    <h1 class="mb-4 text-center">ORCMA Digital Archive</h1>
    <p class="text-center text-muted mb-4">Database Last Updated: {{ last_updated }}</p>
    <div class="row justify-content-center mb-4">
        <div class="col-md-6">
            <form action="/" method="GET" class="d-flex">
                <input type="text" name="q" class="form-control me-2" placeholder="Search composers, pieces, performers..." value="{{ search_query }}">
                <button type="submit" class="btn btn-primary">Search</button>
                <a href="/" class="btn btn-outline-secondary ms-2">Clear</a>
            </form>
        </div>
    </div>

    <div class="card shadow-sm">
        <div class="card-body p-0">
            <div class="table-responsive">
                <table class="table table-striped table-hover mb-0">
                    <thead class="table-dark">
                        <tr>
                            <th>Composer</th>
                            <th>Piece</th>
                            <th>Performer</th>
                            <th>Conductor</th>
                            <th>Date</th>
                            <th>Concert Title</th>
                            <th>Link</th>
                        </tr>
                    </thead>
                    <tbody>
                        {% for row in performances %}
                        <tr>
                            <td class="fw-bold">{{ row['composer'] }}</td>
                            <td>{{ row['piece'] }}</td>
                            <td>{{ row['soloist'] if row['soloist'] else '<span class="text-muted">-</span>'|safe }}</td>
                            <td>{{ row['conductor'] }}</td>
                            <td>{{ row['date'] }}</td>
                            <td class="small">{{ row['concert_title'] }}</td>
                            <td>
                                {% if row['link'] %}
                                    <a href="{{ row['link'] }}" target="_blank" class="btn btn-sm btn-danger">Watch</a>
                                {% endif %}
                            </td>
                        </tr>
                        {% else %}
                        <tr>
                            <td colspan="7" class="text-center py-4 text-muted">No performances found.</td>
                        </tr>
                        {% endfor %}
                    </tbody>
                </table>
            </div>
        </div>
    </div>
</div>

</body>
</html>

 

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and email addresses turn into links automatically.
  • Lines and paragraphs break automatically.

Comment

  • Allowed HTML tags: <b> <h2>
  • No HTML tags allowed.

The comment language code.

CAPTCHA

Enter the characters shown in the image.

This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.