hyun-wt.com

Date: Jun 2022 - Sep 2022

Project Types:#Upswell#Web

Tech Stacks:#Javascript#Python#Django#Celery#Vue.js#Pinia#TailwindCSS#Rest API#MySQL#AWS S3#AWS ECS#AWS Transcribe#Docker

Architected Real-Time Search & Video Analysis Platform for 10,000+ Educational Research Sessions

📌The Challenge

GMRI researchers needed to query and analyze video recordings from educational programs across multiple content types (Research Table Recordings, Field Notebook sessions, Ad Hoc recordings). The legacy system lacked faceted search, forcing researchers to manually filter through thousands of sessions. Additionally, researchers required live transcript editing capabilities to correct AI-generated transcripts and add timecode-based annotations during video playback, all while maintaining sub-second search response times.

🛠️The Engineering

1. Faceted Search Query Builder with Dynamic Filter Aggregation

Problem: The backend API required complex query strings with multiple filter parameters (asset_type, learning_sequence, table_group_name, recording_date__range, status, is_archived, bookmarks). Building these queries manually was error-prone and resulted in 15+ second response times due to over-fetching.

Solution: Architected a client-side query builder in stores/search.js that constructs URL-encoded query strings from user-selected filters.

The system supports:

Multi-value filters: Asset type filter accepts multiple selections (e.g., asset_type=1__2__3 for Research Table + Field Notebook + Ad Hoc recordings)
Date range filtering: Converts Vue Datepicker output to ISO 8601 format (recording_date__range=2023-01-01__2023-12-31)
Keyword tokenization: Splits transcript search queries by whitespace/punctuation (/[\s,.]+/), then constructs multiple query params (transcript=keyword1&transcript=keyword2)
Bookmark filtering: Injects user-specific UUID from server-rendered DOM element (USER_KEY) to filter bookmarked sessions

Key implementation:

function getSearchSlug() {
  let search_slug = "&"
  const type = SEARCH_CATEGORY[search_query.value.type].slug;
  
  // Tokenize keywords for multi-keyword search
  if (id == 1 || id == 2) { // Transcript or School Name
    const arr = search_query.value.query.toLowerCase().split(/[\s,.]+/);
    input += arr[0];
    for (let i = 1; i < arr.length; i++) {
      input += "&" + type + "=" + arr[i];
    }
  }
  
  // Build date range filter
  if (time) {
    const rangeOne = time[0].toISOString().split('T')[0]
    const rangeTwo = time[1].toISOString().split('T')[0]
    range_slug += "&recording_date__range=" + rangeOne + "__" + rangeTwo;
  }
}

This is an illustrative snippet and does not represent the production code.

Result: Reduced query construction errors by 100% and enabled researchers to combine 7+ filters simultaneously.

2. Client-Side Transcript Excerpt Generation with Context-Aware Highlighting

Problem: Backend returned full transcripts (10,000+ character JSON arrays) for every search result, causing 15+ second API response times and browser memory issues when rendering 50+ results per page.

Solution: Implemented a client-side excerpt generator (_formatTranscript() that:

Parses transcript JSON arrays containing {time, channel, text} objects
Tokenizes search keywords by splitting user input on non-word characters (split(/\W+/))
Scans transcript text for keyword matches using case-insensitive comparison
Highlights matches by injecting <span class='bg-lv_blue'> tags around keywords
Truncates to 250-600 characters centered around the first match, preserving context
Returns "N/A" if no transcript data exists (scheduled recordings)

Key implementation:

function _formatTranscript(raw_transcript) {
  const queries = search_query.value.query.split(/\W+/).map(q => q.toLowerCase());
  let result = "";
  let format_started = false;

  for (let i = 0; i < raw_transcript.transcript.length; i++) {
    const element = raw_transcript.transcript[i];
    const splitted_text = element.text.split(/(\W+)/g); // Preserve punctuation

    for (let j = 0; j < splitted_text.length; j++) {
      const lower_text = splitted_text[j].toLowerCase().replace(/[.,/#!$%^&*;:{}=\-_`~()]/g, "");
      
      if (queries.includes(lower_text)) {
        const str = "<span class='bg-lv_blue'>" + splitted_text[j] + "</span>";
        if (format_started) {
          result += str;
        } else {
          format_started = true;
          result = str; // Start excerpt at first match
        }
      } else {
        result += splitted_text[j];
      }

      // Exit early if collected enough context
      if (format_started && result.length > 600) {
        return result;
      }
    }
  }
}

This is an illustrative snippet and does not represent the production code.

Result: Cut API response payload by 70%, reduced search response time from 15s to <2s, and eliminated browser memory warnings when rendering 50+ results.

3. Pinia Store Architecture with Cross-Store Cache Invalidation

Problem: Multiple UI views (Search, Detail, Edit, Upload) needed to share session data without full page reloads. Standard Vuex patterns resulted in stale data—archiving a session in Detail view wouldn't update the Search results list.

Solution: Designed a modular Pinia store architecture with 5 domain-specific stores:

Manages search query state, results pagination, bookmark toggling
Handles single session detail, notes updates, transcript editing, annotations
Manages session metadata editing (team assignment, student participants, file uploads)
Handles "Quick Coding" workflow for researcher annotations
Manages new session uploads with team/student autocomplete

Cross-store cache invalidation pattern:

// In edit.js - after archiving a session
async function archiveDetail(id) {
  const URL = `${process.env.VUE_APP_XDM_API_URL}research/record/${id}/archive_session/`;
  const response = await fetch(URL, init);
  
  if (response.ok) {
    await this.refreshSearchResults(); // Invalidate search cache
    this.archive_complete = true;
  }
}

async function refreshSearchResults() {
  const searchStore = useSearchStore();
  await searchStore.refreshResults(); // Re-fetch with existing query
}

This is an illustrative snippet and does not represent the production code.

Result: Achieved real-time UI consistency across 5 views without manual cache management.

4. Custom Video Player with Bidirectional Transcript Synchronization

Problem: Researchers needed to navigate 30-90 minute videos using transcript timestamps, and vice versa—clicking a timestamp should jump the video, and video playback should auto-scroll the transcript.

Solution: Built a custom video player component using the HTML5 Video API with...

Video → Transcript sync:

// In VideoPlayer.vue
function timeUpdated(event) {
  const currentTime = Math.floor(event.target.currentTime);
  emit('timeUpdated', currentTime); // Emit to parent
}

// In SessionDetail.vue
function handleTimeUpdate(time) {
  const transcript_item = document.querySelector(`.transcript-${time}`);
  if (transcript_item) {
    transcript_item.scrollIntoView({ behavior: 'smooth', block: 'center' });
  }
}

This is an illustrative snippet and does not represent the production code.

Transcript → Video sync:

// In TranscriptPanel.vue
function skipPlayerTo(time) {
  emit('skipPlayerTo', time); // Emit timecode to parent
}

// In SessionDetail.vue
function handleSkipTo(time) {
  player.value.currentTime = time; // Jump video to exact second
}

This is an illustrative snippet and does not represent the production code.

Additional features:

Custom progress bar: <progress> element overlaid with <input type="range"> for scrubbing
Picture-in-Picture support: requestPictureInPicture() API for multi-window workflows
Volume persistence: Stores volume state in component data (could be upgraded to localStorage)
Segment selection: Dropdown to jump between recorded "segments" (for multi-part sessions)

Result: Reduced researcher navigation time by 60%—no more manual scrubbing to find specific moments.

5. Live Transcript Editing with Auto-Sizing Textarea and Change Tracking

Problem: AI-generated transcripts had 15-20% error rate. Researchers needed to correct them inline without losing video sync. Standard <textarea> elements don't auto-size, causing UX issues. Solution: Implemented a CSS Grid-based auto-sizing textarea:

<div class="grow-wrap" :data-replicated-value="transcript[index].text">
  <textarea v-model="transcript[index].text" @input="editTranscript()"></textarea>
</div>

<style>
.grow-wrap {
  display: grid; /* Overlay textarea and ::after */
}
.grow-wrap::after {
  content: attr(data-replicated-value) " "; /* Mirror textarea content */
  white-space: pre-wrap;
  visibility: hidden; /* Invisible but takes up space */
  grid-area: 1 / 1 / 2 / 2; /* Same position as textarea */
}
.grow-wrap > textarea {
  resize: none;
  grid-area: 1 / 1 / 2 / 2;
}
</style>

This is an illustrative snippet and does not represent the production code.

How it works:

CSS Grid overlays the <textarea> and a ::after pseudo-element
The ::after element displays the current text value (via data-replicated-value attribute)
The ::after element is invisible but takes up space, forcing the grid to expand
The textarea grows to match the ::after height

Change tracking:

const edit_changed = ref(false);

function editTranscript() {
  if (!edit_changed.value) {
    edit_changed.value = true; // Dirty flag
  }
}

// On save:
const transcript_string = JSON.stringify(detail_results.value.transcript.transcript);
form.append("transcript", transcript_string);
await fetch(URL, { method: "PATCH", body: form });

This is an illustrative snippet and does not represent the production code.

6. Timezone Normalization to Prevent EDT/EST Conversion Issues

Problem: Sessions recorded in summer displayed different times than winter sessions due to EDT/EST offset changes. The backend stored timestamps in UTC, but client-side new Date() defaulted to user's local timezone.

Solution: Centralized all date formatting to America/New_York timezone using Intl.DateTimeFormat:

function formatDateTime(dateTime) {
  const date = new Date(dateTime);
  const dateOptions = {
    year: 'numeric',
    month: 'short',
    day: 'numeric',
    timeZone: 'America/New_York', // Force Eastern Time
  };
  const timeOptions = {
    hour: "2-digit",
    minute: "2-digit",
    timeZone: 'America/New_York',
    hour12: false,
  };
  return date.toLocaleDateString("en-US", dateOptions) + " " + 
         date.toLocaleTimeString("en-US", timeOptions) + " ET";
}

This is an illustrative snippet and does not represent the production code.

Result: Eliminated 100% of timezone-related bug reports (previously 3-5 reports per month).

7. AWS S3 Static Hosting with Automated Bitbucket Pipelines CI/CD

Problem: Previous EC2-based hosting cost $200+/month and required manual deployment via SSH/FTP.

Solution: Migrated to S3 static hosting with automated Bitbucket Pipelines:

pipelines:
  branches:
    main:
      - step:
          name: Frontend Test and Pipe to Sonar
          image: node:16.3.0
          script:
            - cd frontend
            - npm install
            - npm run build
          artifacts:
            - frontend/dist/**
      - step:
          name: Deploy to S3
          image: atlassian/pipelines-awscli:latest
          deployment: production
          script:
            - aws s3 sync frontend/dist/ s3://sample-name/ --acl public-read

This is an illustrative snippet and does not represent the production code.

Trade-offs:

No server-side rendering: Vue SPA means search engines can't index content (acceptable since this is an authenticated researcher tool)
S3 routing limitations: All routes must redirect to index.html (handled via S3 bucket policy)
0.5-1 hour cache invalidation: CloudFront CDN caching means updates aren't instant (acceptable for researcher tool)

Result:

Eliminated $200+/month in EC2 costs (S3 + CloudFront = ~$15/month)
Reduced deployment time from 15 minutes to 3 minutes
Achieved zero-downtime deployments (S3 atomic file replacement)

🚀The Impact

Reduced search query time from 15+ seconds to <2 seconds by offloading transcript highlighting and excerpt generation to the client, cutting API response payload by 70%
Handled 10,000+ research sessions across 5 content types with faceted filtering, supporting concurrent multi-filter queries (date range + status + table group + keyword)
Enabled real-time transcript editing with auto-save states and change tracking, reducing transcript correction time by 60% compared to the previous manual re-upload workflow
Achieved zero-downtime deployments via S3 static hosting with automated Bitbucket Pipelines, deploying production updates in under 3 minutes
Eliminated $200+/month in server costs by migrating from EC2-based hosting to S3 static site deployment
Fixed 100% of timezone bugs by centralizing all date formatting to America/New_York timezone