Phase 1: Experiment - Implementation Guide¶

Overview¶

Phase 1 is about rapid hypothesis testing and concept validation. The goal is to get from idea to working prototype in days, not weeks, with minimal overhead and maximum learning velocity.

When to Use Phase 1¶

Perfect for: - Testing new data analysis approaches - Validating business hypotheses with stakeholders - Exploring new datasets or data sources - Quick dashboards for one-time analysis - Proof-of-concept implementations

Not suitable for: - Applications requiring high uptime guarantees
- Tools needed by large user groups (>10 people) - Complex multi-user workflows - Applications handling sensitive production data

Technology Stack¶

graph LR
    A[Data Scientist] --> B[Streamlit App]
    B --> C[Railway Deployment]
    C --> D[Supabase Sandbox Schema]
    D --> E[Raw Data Sources]

    style B fill:#FFE4B5
    style C fill:#E6F3FF
    style D fill:#F0FFF0

Core Components¶

Streamlit: Python-based UI framework optimized for data science Railway: Zero-config deployment platform with Git integration
Supabase Sandbox Schema: Isolated database space for experimentation Git Repository: Version control and deployment trigger

Step-by-Step Implementation¶

Step 1: Project Setup¶

# Create new directory
mkdir my-experiment
cd my-experiment

# Initialize git repository
git init
git remote add origin https://github.com/aic-holdings/data-science.git

# Create basic structure
mkdir -p experiments/my-experiment
cd experiments/my-experiment

Step 2: Streamlit Application Template¶

Create app.py:

import streamlit as st
import pandas as pd
import plotly.express as px
from supabase import create_client
import os

# Page configuration
st.set_page_config(
    page_title="My Experiment",
    page_icon="🧪",
    layout="wide"
)

# Supabase connection
@st.cache_resource
def init_supabase():
    url = os.getenv("SUPABASE_URL")
    key = os.getenv("SUPABASE_ANON_KEY") 
    return create_client(url, key)

supabase = init_supabase()

# Main application
def main():
    st.title("🧪 My Experiment")
    st.write("Describe what this experiment tests...")

    # Data loading section
    with st.expander("Data Loading", expanded=True):
        if st.button("Load Data"):
            # Query sandbox schema
            response = supabase.table('sandbox_data').select("*").execute()
            if response.data:
                df = pd.DataFrame(response.data)
                st.dataframe(df)

                # Basic visualization
                fig = px.histogram(df, x='column_name', title='Data Distribution')
                st.plotly_chart(fig, use_container_width=True)
            else:
                st.info("No data found in sandbox")

    # Analysis section  
    with st.expander("Analysis", expanded=False):
        st.write("Add your analysis here...")

    # Results section
    with st.expander("Results", expanded=False):
        st.write("Document your findings...")

if __name__ == "__main__":
    main()

Step 3: Requirements File¶

Create requirements.txt:

streamlit>=1.28.0
plotly>=5.15.0  
pandas>=2.0.0
supabase>=1.0.0
python-dotenv>=1.0.0

Step 4: Railway Configuration¶

Create railway.toml:

[build]
builder = "nixpacks"

[deploy]
startCommand = "streamlit run app.py --server.port=8501 --server.address=0.0.0.0 --server.headless=true --server.enableCORS=false --server.enableXsrfProtection=false"

# Environment variables will be set through Railway dashboard

Step 5: Database Schema Setup¶

Connect to Supabase and create your sandbox tables:

-- Create sandbox schema if not exists
CREATE SCHEMA IF NOT EXISTS sandbox;

-- Example experimental table
CREATE TABLE sandbox.my_experiment_data (
    id SERIAL PRIMARY KEY,
    created_at TIMESTAMP DEFAULT NOW(),
    data_source TEXT,
    raw_data JSONB,
    processed_result NUMERIC,
    notes TEXT
);

-- Grant access to your user
GRANT ALL ON SCHEMA sandbox TO your_user;
GRANT ALL ON ALL TABLES IN SCHEMA sandbox TO your_user;

Step 6: Railway Deployment¶

Connect Repository: Link your Git repo to Railway

Set Environment Variables:

SUPABASE_URL=your_supabase_url
SUPABASE_ANON_KEY=your_anon_key

Deploy: Push to Git triggers automatic deployment

git add .
git commit -m "Initial experiment setup"
git push origin feature/my-experiment

Data Access Patterns¶

Reading Data¶

# Simple query
response = supabase.table('sandbox_data').select("*").execute()
data = response.data

# Filtered query  
response = supabase.table('sandbox_data').select("*").gte('value', 100).execute()

# Join with core data (read-only)
response = supabase.rpc('get_experiment_data', {'param': value}).execute()

Writing Data¶

# Insert experimental results
result = supabase.table('sandbox_results').insert({
    'experiment_name': 'my-experiment',
    'result_data': {'accuracy': 0.85, 'precision': 0.82},
    'created_by': 'data_scientist_name'
}).execute()

Best Practices for Phase 1¶

Code Organization¶

my-experiment/
├── app.py              # Main Streamlit application
├── requirements.txt    # Python dependencies
├── railway.toml       # Deployment configuration
├── utils.py           # Helper functions
├── data/              # Local data files (if any)
└── README.md          # Experiment documentation

Performance Tips¶

Caching: Use @st.cache_data for expensive computations

@st.cache_data
def load_and_process_data():
    # Expensive data loading/processing
    return processed_data

Session State: Maintain state across reruns

if 'data' not in st.session_state:
    st.session_state.data = load_data()

Pagination: For large datasets

page_size = 100
page = st.number_input('Page', min_value=1, value=1)
offset = (page - 1) * page_size

response = supabase.table('data').select("*").range(offset, offset + page_size - 1).execute()

Documentation Standards¶

Always include in your README.md:

# Experiment: [Name]

## Hypothesis
What are you testing?

## Data Sources  
What data are you using?

## Methodology
How are you testing the hypothesis?

## Results
What did you find?

## Next Steps
Should this graduate to Phase 2?

Common Patterns¶

File Upload and Processing¶

uploaded_file = st.file_uploader("Upload CSV", type=['csv'])
if uploaded_file:
    df = pd.read_csv(uploaded_file)
    # Process and store in sandbox
    processed_data = process_data(df)
    store_results(processed_data)

External API Integration¶

@st.cache_data(ttl=3600)  # Cache for 1 hour
def fetch_external_data(symbol):
    # Call external API
    response = requests.get(f"https://api.example.com/data/{symbol}")
    return response.json()

Interactive Filtering¶

# Sidebar controls
st.sidebar.header("Filters")
date_range = st.sidebar.date_input("Date Range", value=[start_date, end_date])
category = st.sidebar.selectbox("Category", options=['A', 'B', 'C'])

# Apply filters
filtered_data = data[
    (data['date'] >= date_range[0]) & 
    (data['date'] <= date_range[1]) &
    (data['category'] == category)
]

Troubleshooting¶

Common Issues¶

Database Connection Errors - Verify environment variables are set correctly - Check Supabase URL format and API key permissions - Ensure sandbox schema exists and has proper grants

Railway Deployment Failures
- Check requirements.txt for version conflicts - Verify railway.toml syntax - Review Railway build logs for specific errors

Streamlit Performance Issues - Add caching to expensive operations - Use pagination for large datasets
- Consider data sampling for initial exploration

Getting Help¶

Check Railway logs: railway logs command
Review Streamlit documentation: Common UI patterns
Test locally first: streamlit run app.py
Ask in team Slack: Share Railway URL for quick debugging

Success Criteria¶

Your Phase 1 experiment is successful when:

✅ Working application: Deployed and accessible via Railway URL
✅ User feedback: At least 3 people have used and provided input
✅ Clear results: Hypothesis is proven or disproven with data
✅ Documentation: Results and next steps are documented
✅ Decision made: Clear recommendation for graduation or archival

Graduation Checklist¶

Before moving to Phase 2, ensure:

Regular usage by target audience (>5 users tested)
Stable data requirements identified
Business value clearly demonstrated
Basic error handling implemented
Performance is acceptable for intended use
Security considerations reviewed
Data contracts defined

Example: MarginIQ Phase 1¶

Week 1: Built basic PDF upload and OCR extraction - Single page Streamlit app - Upload PDF → extract text → display tables - Stored raw results in sandbox.margin_extractions
- 3 users tested with sample Goldman Sachs reports

Results: Proved OCR could extract margin data accurately Decision: Graduate to Phase 2 for production use

This Phase 1 took 5 days and validated the core technical approach before investing in production features.