Skip to content

Phase 1: Experiment - Implementation Guide

Overview

Phase 1 is about rapid hypothesis testing and concept validation. The goal is to get from idea to working prototype in days, not weeks, with minimal overhead and maximum learning velocity.

When to Use Phase 1

Perfect for: - Testing new data analysis approaches - Validating business hypotheses with stakeholders - Exploring new datasets or data sources - Quick dashboards for one-time analysis - Proof-of-concept implementations

Not suitable for: - Applications requiring high uptime guarantees
- Tools needed by large user groups (>10 people) - Complex multi-user workflows - Applications handling sensitive production data

Technology Stack

graph LR
    A[Data Scientist] --> B[Streamlit App]
    B --> C[Railway Deployment]
    C --> D[Supabase Sandbox Schema]
    D --> E[Raw Data Sources]

    style B fill:#FFE4B5
    style C fill:#E6F3FF
    style D fill:#F0FFF0

Core Components

Streamlit: Python-based UI framework optimized for data science Railway: Zero-config deployment platform with Git integration
Supabase Sandbox Schema: Isolated database space for experimentation Git Repository: Version control and deployment trigger

Step-by-Step Implementation

Step 1: Project Setup

# Create new directory
mkdir my-experiment
cd my-experiment

# Initialize git repository
git init
git remote add origin https://github.com/aic-holdings/data-science.git

# Create basic structure
mkdir -p experiments/my-experiment
cd experiments/my-experiment

Step 2: Streamlit Application Template

Create app.py:

import streamlit as st
import pandas as pd
import plotly.express as px
from supabase import create_client
import os

# Page configuration
st.set_page_config(
    page_title="My Experiment",
    page_icon="๐Ÿงช",
    layout="wide"
)

# Supabase connection
@st.cache_resource
def init_supabase():
    url = os.getenv("SUPABASE_URL")
    key = os.getenv("SUPABASE_ANON_KEY") 
    return create_client(url, key)

supabase = init_supabase()

# Main application
def main():
    st.title("๐Ÿงช My Experiment")
    st.write("Describe what this experiment tests...")

    # Data loading section
    with st.expander("Data Loading", expanded=True):
        if st.button("Load Data"):
            # Query sandbox schema
            response = supabase.table('sandbox_data').select("*").execute()
            if response.data:
                df = pd.DataFrame(response.data)
                st.dataframe(df)

                # Basic visualization
                fig = px.histogram(df, x='column_name', title='Data Distribution')
                st.plotly_chart(fig, use_container_width=True)
            else:
                st.info("No data found in sandbox")

    # Analysis section  
    with st.expander("Analysis", expanded=False):
        st.write("Add your analysis here...")

    # Results section
    with st.expander("Results", expanded=False):
        st.write("Document your findings...")

if __name__ == "__main__":
    main()

Step 3: Requirements File

Create requirements.txt:

streamlit>=1.28.0
plotly>=5.15.0  
pandas>=2.0.0
supabase>=1.0.0
python-dotenv>=1.0.0

Step 4: Railway Configuration

Create railway.toml:

[build]
builder = "nixpacks"

[deploy]
startCommand = "streamlit run app.py --server.port=8501 --server.address=0.0.0.0 --server.headless=true --server.enableCORS=false --server.enableXsrfProtection=false"

# Environment variables will be set through Railway dashboard

Step 5: Database Schema Setup

Connect to Supabase and create your sandbox tables:

-- Create sandbox schema if not exists
CREATE SCHEMA IF NOT EXISTS sandbox;

-- Example experimental table
CREATE TABLE sandbox.my_experiment_data (
    id SERIAL PRIMARY KEY,
    created_at TIMESTAMP DEFAULT NOW(),
    data_source TEXT,
    raw_data JSONB,
    processed_result NUMERIC,
    notes TEXT
);

-- Grant access to your user
GRANT ALL ON SCHEMA sandbox TO your_user;
GRANT ALL ON ALL TABLES IN SCHEMA sandbox TO your_user;

Step 6: Railway Deployment

  1. Connect Repository: Link your Git repo to Railway
  2. Set Environment Variables:
    SUPABASE_URL=your_supabase_url
    SUPABASE_ANON_KEY=your_anon_key
    
  3. Deploy: Push to Git triggers automatic deployment
git add .
git commit -m "Initial experiment setup"
git push origin feature/my-experiment

Data Access Patterns

Reading Data

# Simple query
response = supabase.table('sandbox_data').select("*").execute()
data = response.data

# Filtered query  
response = supabase.table('sandbox_data').select("*").gte('value', 100).execute()

# Join with core data (read-only)
response = supabase.rpc('get_experiment_data', {'param': value}).execute()

Writing Data

# Insert experimental results
result = supabase.table('sandbox_results').insert({
    'experiment_name': 'my-experiment',
    'result_data': {'accuracy': 0.85, 'precision': 0.82},
    'created_by': 'data_scientist_name'
}).execute()

Best Practices for Phase 1

Code Organization

my-experiment/
โ”œโ”€โ”€ app.py              # Main Streamlit application
โ”œโ”€โ”€ requirements.txt    # Python dependencies
โ”œโ”€โ”€ railway.toml       # Deployment configuration
โ”œโ”€โ”€ utils.py           # Helper functions
โ”œโ”€โ”€ data/              # Local data files (if any)
โ””โ”€โ”€ README.md          # Experiment documentation

Performance Tips

Caching: Use @st.cache_data for expensive computations

@st.cache_data
def load_and_process_data():
    # Expensive data loading/processing
    return processed_data

Session State: Maintain state across reruns

if 'data' not in st.session_state:
    st.session_state.data = load_data()

Pagination: For large datasets

page_size = 100
page = st.number_input('Page', min_value=1, value=1)
offset = (page - 1) * page_size

response = supabase.table('data').select("*").range(offset, offset + page_size - 1).execute()

Documentation Standards

Always include in your README.md:

# Experiment: [Name]

## Hypothesis
What are you testing?

## Data Sources  
What data are you using?

## Methodology
How are you testing the hypothesis?

## Results
What did you find?

## Next Steps
Should this graduate to Phase 2?

Common Patterns

File Upload and Processing

uploaded_file = st.file_uploader("Upload CSV", type=['csv'])
if uploaded_file:
    df = pd.read_csv(uploaded_file)
    # Process and store in sandbox
    processed_data = process_data(df)
    store_results(processed_data)

External API Integration

@st.cache_data(ttl=3600)  # Cache for 1 hour
def fetch_external_data(symbol):
    # Call external API
    response = requests.get(f"https://api.example.com/data/{symbol}")
    return response.json()

Interactive Filtering

# Sidebar controls
st.sidebar.header("Filters")
date_range = st.sidebar.date_input("Date Range", value=[start_date, end_date])
category = st.sidebar.selectbox("Category", options=['A', 'B', 'C'])

# Apply filters
filtered_data = data[
    (data['date'] >= date_range[0]) & 
    (data['date'] <= date_range[1]) &
    (data['category'] == category)
]

Troubleshooting

Common Issues

Database Connection Errors - Verify environment variables are set correctly - Check Supabase URL format and API key permissions - Ensure sandbox schema exists and has proper grants

Railway Deployment Failures
- Check requirements.txt for version conflicts - Verify railway.toml syntax - Review Railway build logs for specific errors

Streamlit Performance Issues - Add caching to expensive operations - Use pagination for large datasets
- Consider data sampling for initial exploration

Getting Help

  1. Check Railway logs: railway logs command
  2. Review Streamlit documentation: Common UI patterns
  3. Test locally first: streamlit run app.py
  4. Ask in team Slack: Share Railway URL for quick debugging

Success Criteria

Your Phase 1 experiment is successful when:

โœ… Working application: Deployed and accessible via Railway URL
โœ… User feedback: At least 3 people have used and provided input
โœ… Clear results: Hypothesis is proven or disproven with data
โœ… Documentation: Results and next steps are documented
โœ… Decision made: Clear recommendation for graduation or archival

Graduation Checklist

Before moving to Phase 2, ensure:

  • Regular usage by target audience (>5 users tested)
  • Stable data requirements identified
  • Business value clearly demonstrated
  • Basic error handling implemented
  • Performance is acceptable for intended use
  • Security considerations reviewed
  • Data contracts defined

Example: MarginIQ Phase 1

Week 1: Built basic PDF upload and OCR extraction - Single page Streamlit app - Upload PDF โ†’ extract text โ†’ display tables - Stored raw results in sandbox.margin_extractions
- 3 users tested with sample Goldman Sachs reports

Results: Proved OCR could extract margin data accurately Decision: Graduate to Phase 2 for production use

This Phase 1 took 5 days and validated the core technical approach before investing in production features.