LOBSTER Sample Data - AMZN Limit Order Book EDA
This notebook explores the LOBSTER sample dataset for Amazon (AMZN) on June 21, 2012 providing a single full trading day reconstructed at 10 levels of depth from NASDAQ's Historical TotalView-ITCH feed.
LOBSTER (Limit Order Book System - The Efficient Reconstructor) is produced by the University of Vienna and provides tick-level limit order book snapshots and the message stream that drives each state change.
Data source: LOBSTER Sample Files - data.lobsterdata.com/info/DataSamples.php
Download the AMZN Level 10 zip (LOBSTER_SampleFile_AMZN_2012-06-21_10.zip), extract, and place both CSV files alongside this notebook before running.
The two files in the zip:
AMZN...message_10.csv: the message stream (6 columns, no header)AMZN...orderbook_10.csv: the order book snapshots (40 columns, no header)
| Parameter | Value |
|---|---|
| Ticker | AMZN |
| Date | 2012-06-21 |
| Session | 09:30:00 – 16:00:00 ET |
| LOB levels | 10 (bid and ask) |
| Price encoding | Dollar price × 10,000 (integer) |
| Time encoding | Seconds after midnight (float) |
Related papers
LOBSTER: Limit Order Book Reconstruction System
Huang, R. and Polak, T.
SSRN Working Paper, 2011DeepLOB: Deep Learning for Limit Order Books
Zhang, Z., Zohren, S., and Roberts, S.
IEEE Transactions on Signal Processing, 2019Benchmark Dataset for Mid-Price Forecasting of Limit Order Book Data with Machine Learning Methods
Ntakaris, A., Magris, M., Kanniainen, J., Gabbouj, M., and Iosifidis, A.
Journal of Forecasting, 2018
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as mticker
import warnings
warnings.filterwarnings('ignore')
from pathlib import Path
from IPython.display import display, HTML
FL_BLUE = '#2563eb'
FL_SLATE = '#64748b'
FL_AMBER = '#f59e0b'
FL_GREEN = '#16a34a'
FL_RED = '#ef4444'
FL_BG = '#ffffff'
FL_GRID = '#e2e8f0'
FL_TEXT = '#0f172a'
FL_TEXT2 = '#334155'
FL_BORDER = '#e2e8f0'
matplotlib.rcParams.update({
'figure.facecolor': FL_BG,
'axes.facecolor': FL_BG,
'axes.edgecolor': FL_BORDER,
'axes.labelcolor': FL_TEXT2,
'axes.spines.top': False,
'axes.spines.right': False,
'axes.grid': True,
'grid.color': FL_GRID,
'grid.linewidth': 0.7,
'xtick.color': FL_TEXT2,
'ytick.color': FL_TEXT2,
'xtick.labelsize': 10,
'ytick.labelsize': 10,
'axes.labelsize': 11,
'axes.titlesize': 12,
'axes.titlecolor': FL_TEXT,
'axes.titlepad': 12,
'legend.frameon': False,
'legend.fontsize': 10,
'figure.dpi': 300,
'savefig.bbox': 'tight',
'font.family': 'sans-serif',
'font.sans-serif': ['Inter', 'Helvetica Neue', 'Arial', 'DejaVu Sans'],
})
DATA_DIR = Path('./data')
PREFIX = 'AMZN_2012-06-21_34200000_57600000'
MSG_FILE = DATA_DIR / f'{PREFIX}_message_10.csv'
BOOK_FILE = DATA_DIR / f'{PREFIX}_orderbook_10.csv'
N_LEVELS = 10
PRICE_SCALE= 10_000 # prices stored as integer × 10000
SESSION_START = 34_200 # 09:30:00 in seconds after midnight
SESSION_END = 57_600 # 16:00:00 in seconds after midnight
# print(f'Message file exists: {MSG_FILE.exists()}')
# print(f'Order book file exists: {BOOK_FILE.exists()}')
Loading the data
Neither file has a header row. Columns are assigned manually based on the LOBSTER output specification.
Message file (6 columns per row):
Time, EventType, OrderID, Size, Price, Direction
Order book file (4 × N_LEVELS columns, alternating ask/bid per level):
AskPrice1, AskSize1, BidPrice1, BidSize1, AskPrice2, AskSize2, ...
msg = pd.read_csv(
MSG_FILE,
header=None,
names=['Time', 'EventType', 'OrderID', 'Size', 'Price', 'Direction']
)
msg['Time'] = msg['Time'].astype(float)
msg['Price'] = msg['Price'].astype(float) / PRICE_SCALE
msg['Size'] = msg['Size'].astype(int)
msg['EventType'] = msg['EventType'].astype(int)
msg['Direction'] = msg['Direction'].astype(int)
msg['Timestamp'] = pd.to_datetime('2012-06-21') + pd.to_timedelta(msg['Time'], unit='s')
book_cols = []
for lvl in range(1, N_LEVELS + 1):
book_cols += [f'AskPrice{lvl}', f'AskSize{lvl}', f'BidPrice{lvl}', f'BidSize{lvl}']
book = pd.read_csv(BOOK_FILE, header=None, names=book_cols)
price_cols = [c for c in book.columns if 'Price' in c]
book[price_cols] = book[price_cols].astype(float) / PRICE_SCALE
book['Timestamp'] = msg['Timestamp'].values
book['Time'] = msg['Time'].values
for lvl in range(1, N_LEVELS + 1):
book.loc[book[f'AskPrice{lvl}'] > 99999, f'AskPrice{lvl}'] = np.nan
book.loc[book[f'AskPrice{lvl}'] < 0, f'AskPrice{lvl}'] = np.nan
book.loc[book[f'BidPrice{lvl}'] < -99999, f'BidPrice{lvl}'] = np.nan
book.loc[book[f'BidPrice{lvl}'] < 0, f'BidPrice{lvl}'] = np.nan
print('LOBSTER sample loaded')
print(f'Message rows: {len(msg):,}')
print(f'Book rows: {len(book):,}')
print(f'Time range: {msg["Timestamp"].min()} to {msg["Timestamp"].max()}')
msg_cols_df = pd.DataFrame({
'message_column': msg.columns,
'dtype': [str(msg[c].dtype) for c in msg.columns]
})
book_cols_df = pd.DataFrame({
'book_column': book.columns,
'dtype': [str(book[c].dtype) for c in book.columns]
})
print('\nMessage columns')
display(msg_cols_df.head())
print('Book columns')
display(book_cols_df.head())
print('Message preview')
display(msg.head())
print('Book preview')
display(book[['AskPrice1','AskSize1','BidPrice1','BidSize1']].head())
LOBSTER sample loaded Message rows: 269,748 Book rows: 269,748 Time range: 2012-06-21 09:30:00.017459617 to 2012-06-21 15:59:59.959359650 Message columns
| message_column | dtype | |
|---|---|---|
| 0 | Time | float64 |
| 1 | EventType | int64 |
| 2 | OrderID | int64 |
| 3 | Size | int64 |
| 4 | Price | float64 |
Book columns
| book_column | dtype | |
|---|---|---|
| 0 | AskPrice1 | float64 |
| 1 | AskSize1 | int64 |
| 2 | BidPrice1 | float64 |
| 3 | BidSize1 | int64 |
| 4 | AskPrice2 | float64 |
Message preview
| Time | EventType | OrderID | Size | Price | Direction | Timestamp | |
|---|---|---|---|---|---|---|---|
| 0 | 34200.017460 | 5 | 0 | 1 | 223.82 | -1 | 2012-06-21 09:30:00.017459617 |
| 1 | 34200.189608 | 1 | 11885113 | 21 | 223.81 | 1 | 2012-06-21 09:30:00.189607670 |
| 2 | 34200.189608 | 1 | 3911376 | 20 | 223.96 | -1 | 2012-06-21 09:30:00.189607670 |
| 3 | 34200.189608 | 1 | 11534792 | 100 | 223.75 | 1 | 2012-06-21 09:30:00.189607670 |
| 4 | 34200.189608 | 1 | 1365373 | 13 | 224.00 | -1 | 2012-06-21 09:30:00.189607670 |
Book preview
| AskPrice1 | AskSize1 | BidPrice1 | BidSize1 | |
|---|---|---|---|---|
| 0 | 223.95 | 100 | 223.18 | 100 |
| 1 | 223.95 | 100 | 223.81 | 21 |
| 2 | 223.95 | 100 | 223.81 | 21 |
| 3 | 223.95 | 100 | 223.81 | 21 |
| 4 | 223.95 | 100 | 223.81 | 21 |
Message stream overview
The message file records every event that modifies the order book state. Each row describes one limit order action with its timestamp, type, size, price, and direction.
Event types: 1 = New limit order submission | 2 = Partial cancellation | 3 = Full deletion | 4 = Visible execution | 5 = Hidden execution | 7 = Trading halt
event_labels = {
1: 'New submission',
2: 'Partial cancel',
3: 'Full deletion',
4: 'Visible execution',
5: 'Hidden execution',
7: 'Trading halt',
}
event_counts = msg['EventType'].value_counts().sort_index()
event_counts.index = [event_labels.get(i, str(i)) for i in event_counts.index]
direction_counts = msg.loc[msg['Direction'].isin([1, -1]), 'Direction'].value_counts()
direction_counts.index = [
'Buy limit order' if i == 1 else 'Sell limit order'
for i in direction_counts.index
]
plt.figure(figsize=(8, 4.5))
bars = plt.barh(
event_counts.index[::-1],
event_counts.values[::-1],
color=FL_BLUE,
height=0.55
)
for bar, val in zip(bars, event_counts.values[::-1]):
plt.text(
bar.get_width() + max(event_counts.values) * 0.01,
bar.get_y() + bar.get_height() / 2,
f'{val:,}',
va='center',
fontsize=9,
color=FL_TEXT2
)
plt.xlabel('Event count')
plt.title('Message event type distribution')
plt.tick_params(axis='both', length=0)
plt.gca().xaxis.set_major_formatter(
mticker.FuncFormatter(lambda x, _: f'{x/1000:.0f}k' if x >= 1000 else f'{x:.0f}')
)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 4.5))
bars2 = plt.bar(
direction_counts.index,
direction_counts.values,
color=[FL_GREEN, FL_RED],
width=0.4
)
for bar, val in zip(bars2, direction_counts.values):
plt.text(
bar.get_x() + bar.get_width() / 2,
bar.get_height() + max(direction_counts.values) * 0.01,
f'{val:,}',
ha='center',
fontsize=9,
color=FL_TEXT2
)
plt.ylabel('Order count')
plt.title('Order direction distribution')
plt.tick_params(axis='both', length=0)
plt.gca().yaxis.set_major_formatter(
mticker.FuncFormatter(lambda y, _: f'{y/1000:.0f}k' if y >= 1000 else f'{y:.0f}')
)
plt.tight_layout()
plt.show()
summary_df = pd.DataFrame([
{
'Metric': 'Total events',
'Value': f'{len(msg):,}'
},
{
'Metric': 'Unique order IDs',
'Value': f'{msg["OrderID"].nunique():,}'
}
])
event_summary_df = pd.DataFrame({
'Event type': event_counts.index,
'Count': event_counts.values,
'Share': [f'{v / len(msg):.1%}' for v in event_counts.values]
})
display(summary_df)
display(event_summary_df)
| Metric | Value | |
|---|---|---|
| 0 | Total events | 269,748 |
| 1 | Unique order IDs | 138,294 |
| Event type | Count | Share | |
|---|---|---|---|
| 0 | New submission | 131954 | 48.9% |
| 1 | Partial cancel | 2917 | 1.1% |
| 2 | Full deletion | 123458 | 45.8% |
| 3 | Visible execution | 8974 | 3.3% |
| 4 | Hidden execution | 2445 | 0.9% |
Order activity over the trading session
Message arrival rate bucketed into 5-minute intervals across the full trading session (09:30 - 16:00 ET). The first and last 30 minutes typically show elevated activity due to the opening and closing auctions.
msg_session = msg[(msg['Time'] >= SESSION_START) & (msg['Time'] <= SESSION_END)].copy()
msg_session['MinuteBin'] = msg_session['Timestamp'].dt.floor('5min')
activity = msg_session.groupby('MinuteBin')['EventType'].count().rename('count')
trades = (
msg_session[msg_session['EventType'].isin([4, 5])]
.groupby('MinuteBin')['EventType']
.count()
.rename('trades')
)
plt.figure(figsize=(8, 4.5))
plt.fill_between(activity.index, activity.values, alpha=0.15, color=FL_BLUE)
plt.plot(activity.index, activity.values, color=FL_BLUE, linewidth=1.4)
plt.ylabel('Messages / 5 min')
plt.title('Order book message rate over the trading session')
plt.tick_params(axis='both', length=0)
plt.gca().yaxis.set_major_formatter(
mticker.FuncFormatter(lambda y, _: f'{y/1000:.0f}k' if y >= 1000 else f'{y:.0f}')
)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 4.5))
plt.fill_between(trades.index, trades.values, alpha=0.15, color=FL_AMBER)
plt.plot(trades.index, trades.values, color=FL_AMBER, linewidth=1.4)
plt.ylabel('Executions / 5 min')
plt.title('Visible and hidden executions over the trading session')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
plt.gca().xaxis.set_major_locator(mdates.HourLocator())
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
Mid-price and bid-ask spread
The mid-price is the average of the best bid and best ask at each order book snapshot. It is the standard reference price for LOB research and forms the target variable in mid-price movement prediction tasks.
The bid-ask spread (best ask - best bid) giving the price paid to trade against the resting limit order book.
book['MidPrice'] = (book['AskPrice1'] + book['BidPrice1']) / 2
book['Spread'] = book['AskPrice1'] - book['BidPrice1']
book_s = book[(book['Time'] >= SESSION_START) & (book['Time'] <= SESSION_END)].copy()
book_s = book_s.dropna(subset=['MidPrice', 'Spread'])
book_s = book_s[book_s['Spread'] > 0]
sample = book_s.iloc[::50]
plt.figure(figsize=(8, 4.5))
plt.plot(
sample['Timestamp'],
sample['MidPrice'],
color=FL_BLUE,
linewidth=1.2,
label='Mid-price'
)
plt.ylabel('Price ($)')
plt.title('AMZN mid-price on 2012-06-21')
plt.tick_params(axis='both', length=0)
plt.gca().yaxis.set_major_formatter(
mticker.FuncFormatter(lambda y, _: f'${y:.2f}')
)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 4.5))
plt.plot(
sample['Timestamp'],
sample['Spread'] * 10000,
color=FL_AMBER,
linewidth=0.9,
alpha=0.8
)
plt.ylabel('Spread (ticks, 1 tick = $0.0001)')
plt.title('Bid ask spread in ticks')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
plt.gca().xaxis.set_major_locator(mdates.HourLocator())
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
spread_summary = pd.DataFrame([
{
'Metric': 'Mid-price range',
'Value': f'${book_s["MidPrice"].min():.2f} to ${book_s["MidPrice"].max():.2f}'
},
{
'Metric': 'Median spread',
'Value': f'{book_s["Spread"].median() * 10000:.1f} ticks (${book_s["Spread"].median():.4f})'
},
{
'Metric': 'Mean spread',
'Value': f'{book_s["Spread"].mean() * 10000:.1f} ticks (${book_s["Spread"].mean():.4f})'
},
{
'Metric': 'Spread std',
'Value': f'{book_s["Spread"].std() * 10000:.2f} ticks'
}
])
display(spread_summary)
| Metric | Value | |
|---|---|---|
| 0 | Mid-price range | $220.52 to $226.03 |
| 1 | Median spread | 1300.0 ticks ($0.1300) |
| 2 | Mean spread | 1308.6 ticks ($0.1309) |
| 3 | Spread std | 567.88 ticks |
Order book depth profile
Average cumulative volume available at each price level across the full trading session. Deeper levels accumulate more resting liquidity and the shape of this profile reflects how much volume market participants are willing to quote away from the best price.
ask_sizes = [book_s[f'AskSize{lvl}'].mean() for lvl in range(1, N_LEVELS + 1)]
bid_sizes = [book_s[f'BidSize{lvl}'].mean() for lvl in range(1, N_LEVELS + 1)]
cum_ask = np.cumsum(ask_sizes)
cum_bid = np.cumsum(bid_sizes)
levels = np.arange(1, N_LEVELS + 1)
width = 0.35
plt.figure(figsize=(8, 4.5))
plt.bar(
levels - width / 2,
bid_sizes,
width=width,
color=FL_GREEN,
alpha=0.85,
label='Bid'
)
plt.bar(
levels + width / 2,
ask_sizes,
width=width,
color=FL_RED,
alpha=0.85,
label='Ask'
)
plt.xlabel('Level')
plt.ylabel('Average volume (shares)')
plt.title('Average volume per level for bid vs ask')
plt.xticks(levels)
plt.legend()
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 4.5))
plt.fill_between(levels, cum_bid, alpha=0.15, color=FL_GREEN)
plt.plot(
levels,
cum_bid,
color=FL_GREEN,
marker='o',
markersize=5,
linewidth=1.6,
label='Bid cumulative'
)
plt.fill_between(levels, cum_ask, alpha=0.15, color=FL_RED)
plt.plot(
levels,
cum_ask,
color=FL_RED,
marker='o',
markersize=5,
linewidth=1.6,
label='Ask cumulative'
)
plt.xlabel('Level')
plt.ylabel('Cumulative volume (shares)')
plt.title('Cumulative depth for bid vs ask across 10 levels')
plt.xticks(levels)
plt.legend()
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
level_volume_df = pd.DataFrame({
'Level': levels,
'Bid avg': bid_sizes,
'Ask avg': ask_sizes
})
level_volume_df[['Bid avg', 'Ask avg']] = (
level_volume_df[['Bid avg', 'Ask avg']].round(0).astype(int)
)
display(level_volume_df)
| Level | Bid avg | Ask avg | |
|---|---|---|---|
| 0 | 1 | 245 | 139 |
| 1 | 2 | 139 | 106 |
| 2 | 3 | 148 | 99 |
| 3 | 4 | 152 | 113 |
| 4 | 5 | 166 | 135 |
| 5 | 6 | 186 | 150 |
| 6 | 7 | 205 | 169 |
| 7 | 8 | 213 | 183 |
| 8 | 9 | 218 | 190 |
| 9 | 10 | 212 | 194 |
Limit order book snapshot
A single LOB snapshot shows the resting volume available at each price level at one instant in time. The chart below renders the book state at 10:00:00 ET.
Each horizontal bar represents one price level. Green bars = bid side (buyers). Red bars = ask side (sellers). The gap between the innermost bid and ask bars is the bid-ask spread.
TARGET_SEC = 36_000
snap_idx = book_s[book_s['Time'] >= TARGET_SEC].index[0]
snap = book.loc[snap_idx]
bid_prices = [snap[f'BidPrice{l}'] for l in range(1, N_LEVELS + 1)]
bid_sizes = [snap[f'BidSize{l}'] for l in range(1, N_LEVELS + 1)]
ask_prices = [snap[f'AskPrice{l}'] for l in range(1, N_LEVELS + 1)]
ask_sizes = [snap[f'AskSize{l}'] for l in range(1, N_LEVELS + 1)]
bid_data = [(p, s) for p, s in zip(bid_prices, bid_sizes) if pd.notna(p) and pd.notna(s) and s > 0]
ask_data = [(p, s) for p, s in zip(ask_prices, ask_sizes) if pd.notna(p) and pd.notna(s) and s > 0]
snap_time = snap['Timestamp'].strftime('%H:%M:%S') if hasattr(snap['Timestamp'], 'strftime') else '10:00:xx'
plt.figure(figsize=(8, 4.5))
if bid_data:
bp, bs = zip(*bid_data)
bid_height = min(0.005, (max(bp) - min(bp)) / (len(bp) * 1.5) if len(bp) > 1 else 0.005)
plt.barh(
bp,
[-s for s in bs],
left=0,
color=FL_GREEN,
alpha=0.8,
height=bid_height,
label='Bid'
)
if ask_data:
ap, as_ = zip(*ask_data)
ask_height = min(0.005, (max(ap) - min(ap)) / (len(ap) * 1.5) if len(ap) > 1 else 0.005)
plt.barh(
ap,
as_,
left=0,
color=FL_RED,
alpha=0.8,
height=ask_height,
label='Ask'
)
plt.axvline(0, color=FL_BORDER, linewidth=1)
plt.xlabel('Volume (shares) | left = bid, right = ask')
plt.ylabel('Price ($)')
plt.title(f'AMZN limit order book snapshot at {snap_time} ET (10 levels)')
plt.gca().yaxis.set_major_formatter(
mticker.FuncFormatter(lambda y, _: f'${y:.2f}')
)
plt.gca().xaxis.set_major_formatter(
mticker.FuncFormatter(lambda x, _: f'{abs(x):,.0f}')
)
plt.legend(loc='lower right')
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
if bid_data and ask_data:
mid = (bid_data[0][0] + ask_data[0][0]) / 2
spread = ask_data[0][0] - bid_data[0][0]
snapshot_df = pd.DataFrame([
{
'Metric': 'Snapshot time',
'Value': f'{snap_time} ET'
},
{
'Metric': 'Best bid',
'Value': f'${bid_data[0][0]:.4f} ({bid_data[0][1]:,} shares)'
},
{
'Metric': 'Best ask',
'Value': f'${ask_data[0][0]:.4f} ({ask_data[0][1]:,} shares)'
},
{
'Metric': 'Mid-price',
'Value': f'${mid:.4f}'
},
{
'Metric': 'Spread',
'Value': f'${spread:.4f}'
},
{
'Metric': 'Spread (ticks)',
'Value': f'{spread * 10000:.0f}'
}
])
display(snapshot_df)
| Metric | Value | |
|---|---|---|
| 0 | Snapshot time | 10:00:00 ET |
| 1 | Best bid | $224.1700 (200 shares) |
| 2 | Best ask | $224.3500 (100 shares) |
| 3 | Mid-price | $224.2600 |
| 4 | Spread | $0.1800 |
| 5 | Spread (ticks) | 1800 |
Trade analysis
Executions (event types 4 and 5) represent actual trades - a market order hitting a resting limit order. The direction field indicates the aggressor side: a buy-initiated trade executes against a resting sell limit order (ask side) and vice versa.
trades = msg[msg['EventType'].isin([4, 5])].copy()
trades = trades[(trades['Time'] >= SESSION_START) & (trades['Time'] <= SESSION_END)]
trades['DollarValue'] = trades['Price'] * trades['Size']
trades['Bin'] = trades['Timestamp'].dt.floor('30min')
vwap = trades.groupby('Bin').apply(
lambda g: (g['Price'] * g['Size']).sum() / g['Size'].sum()
).rename('VWAP')
vol_bin = trades.groupby('Bin')['Size'].sum().rename('Volume')
plt.figure(figsize=(8, 4.5))
plt.hist(
trades['Size'].clip(upper=2000),
bins=50,
color=FL_BLUE,
alpha=0.8,
edgecolor='none'
)
plt.xlabel('Trade size (shares, clipped at 2,000)')
plt.ylabel('Frequency')
plt.title('Trade size distribution')
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 4.5))
plt.bar(
vol_bin.index,
vol_bin.values,
width=pd.Timedelta('28min'),
color=FL_SLATE,
alpha=0.85
)
plt.ylabel('Total volume (shares)')
plt.title('Intraday volume in 30 minute buckets')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
plt.gca().xaxis.set_major_locator(mdates.HourLocator(interval=2))
plt.gca().yaxis.set_major_formatter(
mticker.FuncFormatter(lambda y, _: f'{y/1000:.0f}k' if y >= 1000 else f'{y:.0f}')
)
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 4.5))
plt.step(
vwap.index,
vwap.values,
color=FL_AMBER,
linewidth=1.6,
where='post'
)
plt.ylabel('VWAP ($)')
plt.title('30 minute VWAP over the session')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
plt.gca().xaxis.set_major_locator(mdates.HourLocator(interval=2))
plt.gca().yaxis.set_major_formatter(
mticker.FuncFormatter(lambda y, _: f'${y:.2f}')
)
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
trade_summary = pd.DataFrame([
{
'Metric': 'Total trades',
'Value': f'{len(trades):,}'
},
{
'Metric': 'Total volume',
'Value': f'{trades["Size"].sum():,} shares'
},
{
'Metric': 'Total dollar volume',
'Value': f'${trades["DollarValue"].sum():,.0f}'
},
{
'Metric': 'Median trade size',
'Value': f'{trades["Size"].median():.0f} shares'
},
{
'Metric': 'Mean trade size',
'Value': f'{trades["Size"].mean():.1f} shares'
},
{
'Metric': 'Buy-initiated trades',
'Value': f'{(trades["Direction"] == -1).sum():,} ({(trades["Direction"] == -1).mean():.1%})'
},
{
'Metric': 'Sell-initiated trades',
'Value': f'{(trades["Direction"] == 1).sum():,} ({(trades["Direction"] == 1).mean():.1%})'
}
])
display(trade_summary)
| Metric | Value | |
|---|---|---|
| 0 | Total trades | 11,419 |
| 1 | Total volume | 810,755 shares |
| 2 | Total dollar volume | $180,502,249 |
| 3 | Median trade size | 66 shares |
| 4 | Mean trade size | 71.0 shares |
| 5 | Buy-initiated trades | 5,904 (51.7%) |
| 6 | Sell-initiated trades | 5,515 (48.3%) |
LOB evolution over a 10-minute window
Tracking how the best bid, best ask, mid-price, and total depth at level 1 evolve over a representative 10-minute window (10:00 - 10:10 ET). This illustrates the continuous state changes driven by order submissions, cancellations, and executions.
WIN_START = 36_000
WIN_END = 36_600
win = book_s[(book_s['Time'] >= WIN_START) & (book_s['Time'] < WIN_END)].copy()
plt.figure(figsize=(8, 4.5))
plt.plot(
win['Timestamp'],
win['BidPrice1'],
color=FL_GREEN,
linewidth=1.0,
alpha=0.8,
label='Best bid'
)
plt.plot(
win['Timestamp'],
win['AskPrice1'],
color=FL_RED,
linewidth=1.0,
alpha=0.8,
label='Best ask'
)
plt.plot(
win['Timestamp'],
win['MidPrice'],
color=FL_BLUE,
linewidth=1.4,
label='Mid-price'
)
plt.fill_between(
win['Timestamp'],
win['BidPrice1'],
win['AskPrice1'],
alpha=0.07,
color=FL_SLATE
)
plt.ylabel('Price ($)')
plt.title('Best bid, best ask, and mid-price from 10:00 to 10:10 ET')
plt.gca().yaxis.set_major_formatter(
mticker.FuncFormatter(lambda y, _: f'${y:.2f}')
)
plt.legend(loc='upper left', fontsize=9)
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 4.5))
plt.fill_between(win['Timestamp'], win['Spread'] * 10000, alpha=0.3, color=FL_AMBER)
plt.plot(win['Timestamp'], win['Spread'] * 10000, color=FL_AMBER, linewidth=1.0)
plt.ylabel('Spread (ticks)')
plt.title('Bid ask spread from 10:00 to 10:10 ET')
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 4.5))
plt.fill_between(win['Timestamp'], win['BidSize1'], alpha=0.2, color=FL_GREEN, label='Bid L1')
plt.fill_between(win['Timestamp'], win['AskSize1'], alpha=0.2, color=FL_RED, label='Ask L1')
plt.plot(win['Timestamp'], win['BidSize1'], color=FL_GREEN, linewidth=0.8)
plt.plot(win['Timestamp'], win['AskSize1'], color=FL_RED, linewidth=0.8)
plt.ylabel('Volume at L1 (shares)')
plt.title('Level 1 resting volume from 10:00 to 10:10 ET')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M:%S'))
plt.gca().xaxis.set_major_locator(mdates.MinuteLocator(interval=2))
plt.legend(loc='upper right', fontsize=9)
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
Spread and mid-price return distribution
The spread distribution and the distribution of mid-price changes between consecutive events. Most mid-price moves are zero (the book updates without crossing the mid-price), with rare large moves during periods of elevated volatility.
spread_ticks = (book_s['Spread'] * 10000).round().astype(int)
spread_counts = spread_ticks.value_counts().sort_index().head(20)
mid_changes = book_s['MidPrice'].diff().dropna()
mid_changes_ticks = (mid_changes * 10000).round().astype(int)
plt.figure(figsize=(8, 4.5))
plt.bar(
spread_counts.index,
spread_counts.values,
color=FL_BLUE,
alpha=0.85,
width=0.6
)
plt.xlabel('Spread (ticks)')
plt.ylabel('Frequency')
plt.title('Bid ask spread distribution in ticks')
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
change_counts = mid_changes_ticks.value_counts().sort_index()
change_counts = change_counts[(change_counts.index >= -10) & (change_counts.index <= 10)]
colors = [FL_GREEN if i > 0 else (FL_RED if i < 0 else FL_SLATE) for i in change_counts.index]
plt.figure(figsize=(8, 4.5))
plt.bar(
change_counts.index,
change_counts.values,
color=colors,
alpha=0.85,
width=0.6
)
plt.xlabel('Mid-price change (ticks)')
plt.ylabel('Frequency')
plt.title('Mid-price change per event within plus or minus 10 ticks')
plt.tick_params(axis='both', length=0)
plt.gca().yaxis.set_major_formatter(
mticker.FuncFormatter(lambda y, _: f'{y/1000:.0f}k' if y >= 1000 else f'{y:.0f}')
)
plt.tight_layout()
plt.show()
zero_pct = (mid_changes_ticks == 0).mean()
micro_summary = pd.DataFrame([
{
'Metric': 'Mid-price unchanged',
'Value': f'{zero_pct:.1%}'
},
{
'Metric': 'Up moves',
'Value': f'{(mid_changes_ticks > 0).mean():.1%}'
},
{
'Metric': 'Down moves',
'Value': f'{(mid_changes_ticks < 0).mean():.1%}'
},
{
'Metric': 'Most common spread',
'Value': f'{spread_counts.idxmax()} ticks ({spread_counts.max() / len(book_s):.1%} of snapshots)'
}
])
display(micro_summary)
| Metric | Value | |
|---|---|---|
| 0 | Mid-price unchanged | 89.8% |
| 1 | Up moves | 5.1% |
| 2 | Down moves | 5.1% |
| 3 | Most common spread | 1200 ticks (8.0% of snapshots) |