Phase 1: Experiment - Implementation Guide¶
Overview¶
Phase 1 is about rapid hypothesis testing and concept validation. The goal is to get from idea to working prototype in days, not weeks, with minimal overhead and maximum learning velocity.
When to Use Phase 1¶
Perfect for: - Testing new data analysis approaches - Validating business hypotheses with stakeholders - Exploring new datasets or data sources - Quick dashboards for one-time analysis - Proof-of-concept implementations
Not suitable for:
- Applications requiring high uptime guarantees
- Tools needed by large user groups (>10 people)
- Complex multi-user workflows
- Applications handling sensitive production data
Technology Stack¶
graph LR
A[Data Scientist] --> B[Streamlit App]
B --> C[Railway Deployment]
C --> D[Supabase Sandbox Schema]
D --> E[Raw Data Sources]
style B fill:#FFE4B5
style C fill:#E6F3FF
style D fill:#F0FFF0
Core Components¶
Streamlit: Python-based UI framework optimized for data science
Railway: Zero-config deployment platform with Git integration
Supabase Sandbox Schema: Isolated database space for experimentation
Git Repository: Version control and deployment trigger
Step-by-Step Implementation¶
Step 1: Project Setup¶
# Create new directory
mkdir my-experiment
cd my-experiment
# Initialize git repository
git init
git remote add origin https://github.com/aic-holdings/data-science.git
# Create basic structure
mkdir -p experiments/my-experiment
cd experiments/my-experiment
Step 2: Streamlit Application Template¶
Create app.py:
import streamlit as st
import pandas as pd
import plotly.express as px
from supabase import create_client
import os
# Page configuration
st.set_page_config(
page_title="My Experiment",
page_icon="๐งช",
layout="wide"
)
# Supabase connection
@st.cache_resource
def init_supabase():
url = os.getenv("SUPABASE_URL")
key = os.getenv("SUPABASE_ANON_KEY")
return create_client(url, key)
supabase = init_supabase()
# Main application
def main():
st.title("๐งช My Experiment")
st.write("Describe what this experiment tests...")
# Data loading section
with st.expander("Data Loading", expanded=True):
if st.button("Load Data"):
# Query sandbox schema
response = supabase.table('sandbox_data').select("*").execute()
if response.data:
df = pd.DataFrame(response.data)
st.dataframe(df)
# Basic visualization
fig = px.histogram(df, x='column_name', title='Data Distribution')
st.plotly_chart(fig, use_container_width=True)
else:
st.info("No data found in sandbox")
# Analysis section
with st.expander("Analysis", expanded=False):
st.write("Add your analysis here...")
# Results section
with st.expander("Results", expanded=False):
st.write("Document your findings...")
if __name__ == "__main__":
main()
Step 3: Requirements File¶
Create requirements.txt:
Step 4: Railway Configuration¶
Create railway.toml:
[build]
builder = "nixpacks"
[deploy]
startCommand = "streamlit run app.py --server.port=8501 --server.address=0.0.0.0 --server.headless=true --server.enableCORS=false --server.enableXsrfProtection=false"
# Environment variables will be set through Railway dashboard
Step 5: Database Schema Setup¶
Connect to Supabase and create your sandbox tables:
-- Create sandbox schema if not exists
CREATE SCHEMA IF NOT EXISTS sandbox;
-- Example experimental table
CREATE TABLE sandbox.my_experiment_data (
id SERIAL PRIMARY KEY,
created_at TIMESTAMP DEFAULT NOW(),
data_source TEXT,
raw_data JSONB,
processed_result NUMERIC,
notes TEXT
);
-- Grant access to your user
GRANT ALL ON SCHEMA sandbox TO your_user;
GRANT ALL ON ALL TABLES IN SCHEMA sandbox TO your_user;
Step 6: Railway Deployment¶
- Connect Repository: Link your Git repo to Railway
- Set Environment Variables:
- Deploy: Push to Git triggers automatic deployment
Data Access Patterns¶
Reading Data¶
# Simple query
response = supabase.table('sandbox_data').select("*").execute()
data = response.data
# Filtered query
response = supabase.table('sandbox_data').select("*").gte('value', 100).execute()
# Join with core data (read-only)
response = supabase.rpc('get_experiment_data', {'param': value}).execute()
Writing Data¶
# Insert experimental results
result = supabase.table('sandbox_results').insert({
'experiment_name': 'my-experiment',
'result_data': {'accuracy': 0.85, 'precision': 0.82},
'created_by': 'data_scientist_name'
}).execute()
Best Practices for Phase 1¶
Code Organization¶
my-experiment/
โโโ app.py # Main Streamlit application
โโโ requirements.txt # Python dependencies
โโโ railway.toml # Deployment configuration
โโโ utils.py # Helper functions
โโโ data/ # Local data files (if any)
โโโ README.md # Experiment documentation
Performance Tips¶
Caching: Use @st.cache_data for expensive computations
@st.cache_data
def load_and_process_data():
# Expensive data loading/processing
return processed_data
Session State: Maintain state across reruns
Pagination: For large datasets
page_size = 100
page = st.number_input('Page', min_value=1, value=1)
offset = (page - 1) * page_size
response = supabase.table('data').select("*").range(offset, offset + page_size - 1).execute()
Documentation Standards¶
Always include in your README.md:
# Experiment: [Name]
## Hypothesis
What are you testing?
## Data Sources
What data are you using?
## Methodology
How are you testing the hypothesis?
## Results
What did you find?
## Next Steps
Should this graduate to Phase 2?
Common Patterns¶
File Upload and Processing¶
uploaded_file = st.file_uploader("Upload CSV", type=['csv'])
if uploaded_file:
df = pd.read_csv(uploaded_file)
# Process and store in sandbox
processed_data = process_data(df)
store_results(processed_data)
External API Integration¶
@st.cache_data(ttl=3600) # Cache for 1 hour
def fetch_external_data(symbol):
# Call external API
response = requests.get(f"https://api.example.com/data/{symbol}")
return response.json()
Interactive Filtering¶
# Sidebar controls
st.sidebar.header("Filters")
date_range = st.sidebar.date_input("Date Range", value=[start_date, end_date])
category = st.sidebar.selectbox("Category", options=['A', 'B', 'C'])
# Apply filters
filtered_data = data[
(data['date'] >= date_range[0]) &
(data['date'] <= date_range[1]) &
(data['category'] == category)
]
Troubleshooting¶
Common Issues¶
Database Connection Errors - Verify environment variables are set correctly - Check Supabase URL format and API key permissions - Ensure sandbox schema exists and has proper grants
Railway Deployment Failures
- Check requirements.txt for version conflicts
- Verify railway.toml syntax
- Review Railway build logs for specific errors
Streamlit Performance Issues
- Add caching to expensive operations
- Use pagination for large datasets
- Consider data sampling for initial exploration
Getting Help¶
- Check Railway logs:
railway logscommand - Review Streamlit documentation: Common UI patterns
- Test locally first:
streamlit run app.py - Ask in team Slack: Share Railway URL for quick debugging
Success Criteria¶
Your Phase 1 experiment is successful when:
โ
Working application: Deployed and accessible via Railway URL
โ
User feedback: At least 3 people have used and provided input
โ
Clear results: Hypothesis is proven or disproven with data
โ
Documentation: Results and next steps are documented
โ
Decision made: Clear recommendation for graduation or archival
Graduation Checklist¶
Before moving to Phase 2, ensure:
- Regular usage by target audience (>5 users tested)
- Stable data requirements identified
- Business value clearly demonstrated
- Basic error handling implemented
- Performance is acceptable for intended use
- Security considerations reviewed
- Data contracts defined
Example: MarginIQ Phase 1¶
Week 1: Built basic PDF upload and OCR extraction
- Single page Streamlit app
- Upload PDF โ extract text โ display tables
- Stored raw results in sandbox.margin_extractions
- 3 users tested with sample Goldman Sachs reports
Results: Proved OCR could extract margin data accurately Decision: Graduate to Phase 2 for production use
This Phase 1 took 5 days and validated the core technical approach before investing in production features.