Ctrl K

LOBSTER Sample Data - AMZN Limit Order Book EDA

This notebook explores the LOBSTER sample dataset for Amazon (AMZN) on June 21, 2012 providing a single full trading day reconstructed at 10 levels of depth from NASDAQ's Historical TotalView-ITCH feed.

LOBSTER (Limit Order Book System - The Efficient Reconstructor) is produced by the University of Vienna and provides tick-level limit order book snapshots and the message stream that drives each state change.

Data source: LOBSTER Sample Files - data.lobsterdata.com/info/DataSamples.php
Download the AMZN Level 10 zip (LOBSTER_SampleFile_AMZN_2012-06-21_10.zip), extract, and place both CSV files alongside this notebook before running.

The two files in the zip:

  • AMZN...message_10.csv: the message stream (6 columns, no header)
  • AMZN...orderbook_10.csv: the order book snapshots (40 columns, no header)
Parameter Value
Ticker AMZN
Date 2012-06-21
Session 09:30:00 – 16:00:00 ET
LOB levels 10 (bid and ask)
Price encoding Dollar price × 10,000 (integer)
Time encoding Seconds after midnight (float)

Related papers

  • LOBSTER: Limit Order Book Reconstruction System
    Huang, R. and Polak, T.
    SSRN Working Paper, 2011

  • DeepLOB: Deep Learning for Limit Order Books
    Zhang, Z., Zohren, S., and Roberts, S.
    IEEE Transactions on Signal Processing, 2019

  • Benchmark Dataset for Mid-Price Forecasting of Limit Order Book Data with Machine Learning Methods
    Ntakaris, A., Magris, M., Kanniainen, J., Gabbouj, M., and Iosifidis, A.
    Journal of Forecasting, 2018

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as mticker
import warnings
warnings.filterwarnings('ignore')

from pathlib import Path
from IPython.display import display, HTML

FL_BLUE   = '#2563eb'
FL_SLATE  = '#64748b'
FL_AMBER  = '#f59e0b'
FL_GREEN  = '#16a34a'
FL_RED    = '#ef4444'
FL_BG     = '#ffffff'
FL_GRID   = '#e2e8f0'
FL_TEXT   = '#0f172a'
FL_TEXT2  = '#334155'
FL_BORDER = '#e2e8f0'

matplotlib.rcParams.update({
    'figure.facecolor':  FL_BG,
    'axes.facecolor':    FL_BG,
    'axes.edgecolor':    FL_BORDER,
    'axes.labelcolor':   FL_TEXT2,
    'axes.spines.top':   False,
    'axes.spines.right': False,
    'axes.grid':         True,
    'grid.color':        FL_GRID,
    'grid.linewidth':    0.7,
    'xtick.color':       FL_TEXT2,
    'ytick.color':       FL_TEXT2,
    'xtick.labelsize':   10,
    'ytick.labelsize':   10,
    'axes.labelsize':    11,
    'axes.titlesize':    12,
    'axes.titlecolor':   FL_TEXT,
    'axes.titlepad':     12,
    'legend.frameon':    False,
    'legend.fontsize':   10,
    'figure.dpi':        300,
    'savefig.bbox':      'tight',
    'font.family':       'sans-serif',
    'font.sans-serif':   ['Inter', 'Helvetica Neue', 'Arial', 'DejaVu Sans'],
})

DATA_DIR   = Path('./data')
PREFIX     = 'AMZN_2012-06-21_34200000_57600000'
MSG_FILE   = DATA_DIR / f'{PREFIX}_message_10.csv'
BOOK_FILE  = DATA_DIR / f'{PREFIX}_orderbook_10.csv'
N_LEVELS   = 10
PRICE_SCALE= 10_000   # prices stored as integer × 10000
SESSION_START = 34_200  # 09:30:00 in seconds after midnight
SESSION_END   = 57_600  # 16:00:00 in seconds after midnight

# print(f'Message file exists:   {MSG_FILE.exists()}')
# print(f'Order book file exists: {BOOK_FILE.exists()}')

Loading the data

Neither file has a header row. Columns are assigned manually based on the LOBSTER output specification.

Message file (6 columns per row): Time, EventType, OrderID, Size, Price, Direction

Order book file (4 × N_LEVELS columns, alternating ask/bid per level): AskPrice1, AskSize1, BidPrice1, BidSize1, AskPrice2, AskSize2, ...

In [2]:
msg = pd.read_csv(
    MSG_FILE,
    header=None,
    names=['Time', 'EventType', 'OrderID', 'Size', 'Price', 'Direction']
)

msg['Time'] = msg['Time'].astype(float)
msg['Price'] = msg['Price'].astype(float) / PRICE_SCALE
msg['Size'] = msg['Size'].astype(int)
msg['EventType'] = msg['EventType'].astype(int)
msg['Direction'] = msg['Direction'].astype(int)

msg['Timestamp'] = pd.to_datetime('2012-06-21') + pd.to_timedelta(msg['Time'], unit='s')

book_cols = []
for lvl in range(1, N_LEVELS + 1):
    book_cols += [f'AskPrice{lvl}', f'AskSize{lvl}', f'BidPrice{lvl}', f'BidSize{lvl}']

book = pd.read_csv(BOOK_FILE, header=None, names=book_cols)

price_cols = [c for c in book.columns if 'Price' in c]
book[price_cols] = book[price_cols].astype(float) / PRICE_SCALE

book['Timestamp'] = msg['Timestamp'].values
book['Time'] = msg['Time'].values

for lvl in range(1, N_LEVELS + 1):
    book.loc[book[f'AskPrice{lvl}'] > 99999, f'AskPrice{lvl}'] = np.nan
    book.loc[book[f'AskPrice{lvl}'] < 0, f'AskPrice{lvl}'] = np.nan
    book.loc[book[f'BidPrice{lvl}'] < -99999, f'BidPrice{lvl}'] = np.nan
    book.loc[book[f'BidPrice{lvl}'] < 0, f'BidPrice{lvl}'] = np.nan

print('LOBSTER sample loaded')
print(f'Message rows:   {len(msg):,}')
print(f'Book rows:      {len(book):,}')
print(f'Time range:     {msg["Timestamp"].min()} to {msg["Timestamp"].max()}')

msg_cols_df = pd.DataFrame({
    'message_column': msg.columns,
    'dtype': [str(msg[c].dtype) for c in msg.columns]
})

book_cols_df = pd.DataFrame({
    'book_column': book.columns,
    'dtype': [str(book[c].dtype) for c in book.columns]
})

print('\nMessage columns')
display(msg_cols_df.head())

print('Book columns')
display(book_cols_df.head())

print('Message preview')
display(msg.head())

print('Book preview')
display(book[['AskPrice1','AskSize1','BidPrice1','BidSize1']].head())
LOBSTER sample loaded
Message rows:   269,748
Book rows:      269,748
Time range:     2012-06-21 09:30:00.017459617 to 2012-06-21 15:59:59.959359650

Message columns
message_column dtype
0 Time float64
1 EventType int64
2 OrderID int64
3 Size int64
4 Price float64
Book columns
book_column dtype
0 AskPrice1 float64
1 AskSize1 int64
2 BidPrice1 float64
3 BidSize1 int64
4 AskPrice2 float64
Message preview
Time EventType OrderID Size Price Direction Timestamp
0 34200.017460 5 0 1 223.82 -1 2012-06-21 09:30:00.017459617
1 34200.189608 1 11885113 21 223.81 1 2012-06-21 09:30:00.189607670
2 34200.189608 1 3911376 20 223.96 -1 2012-06-21 09:30:00.189607670
3 34200.189608 1 11534792 100 223.75 1 2012-06-21 09:30:00.189607670
4 34200.189608 1 1365373 13 224.00 -1 2012-06-21 09:30:00.189607670
Book preview
AskPrice1 AskSize1 BidPrice1 BidSize1
0 223.95 100 223.18 100
1 223.95 100 223.81 21
2 223.95 100 223.81 21
3 223.95 100 223.81 21
4 223.95 100 223.81 21

Message stream overview

The message file records every event that modifies the order book state. Each row describes one limit order action with its timestamp, type, size, price, and direction.

Event types: 1 = New limit order submission  |  2 = Partial cancellation  |  3 = Full deletion  |  4 = Visible execution  |  5 = Hidden execution  |  7 = Trading halt

In [3]:
event_labels = {
    1: 'New submission',
    2: 'Partial cancel',
    3: 'Full deletion',
    4: 'Visible execution',
    5: 'Hidden execution',
    7: 'Trading halt',
}

event_counts = msg['EventType'].value_counts().sort_index()
event_counts.index = [event_labels.get(i, str(i)) for i in event_counts.index]

direction_counts = msg.loc[msg['Direction'].isin([1, -1]), 'Direction'].value_counts()
direction_counts.index = [
    'Buy limit order' if i == 1 else 'Sell limit order'
    for i in direction_counts.index
]

plt.figure(figsize=(8, 4.5))
bars = plt.barh(
    event_counts.index[::-1],
    event_counts.values[::-1],
    color=FL_BLUE,
    height=0.55
)

for bar, val in zip(bars, event_counts.values[::-1]):
    plt.text(
        bar.get_width() + max(event_counts.values) * 0.01,
        bar.get_y() + bar.get_height() / 2,
        f'{val:,}',
        va='center',
        fontsize=9,
        color=FL_TEXT2
    )

plt.xlabel('Event count')
plt.title('Message event type distribution')
plt.tick_params(axis='both', length=0)
plt.gca().xaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f'{x/1000:.0f}k' if x >= 1000 else f'{x:.0f}')
)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
bars2 = plt.bar(
    direction_counts.index,
    direction_counts.values,
    color=[FL_GREEN, FL_RED],
    width=0.4
)

for bar, val in zip(bars2, direction_counts.values):
    plt.text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() + max(direction_counts.values) * 0.01,
        f'{val:,}',
        ha='center',
        fontsize=9,
        color=FL_TEXT2
    )

plt.ylabel('Order count')
plt.title('Order direction distribution')
plt.tick_params(axis='both', length=0)
plt.gca().yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda y, _: f'{y/1000:.0f}k' if y >= 1000 else f'{y:.0f}')
)
plt.tight_layout()
plt.show()

summary_df = pd.DataFrame([
    {
        'Metric': 'Total events',
        'Value': f'{len(msg):,}'
    },
    {
        'Metric': 'Unique order IDs',
        'Value': f'{msg["OrderID"].nunique():,}'
    }
])

event_summary_df = pd.DataFrame({
    'Event type': event_counts.index,
    'Count': event_counts.values,
    'Share': [f'{v / len(msg):.1%}' for v in event_counts.values]
})

display(summary_df)
display(event_summary_df)
No description has been provided for this image
No description has been provided for this image
Metric Value
0 Total events 269,748
1 Unique order IDs 138,294
Event type Count Share
0 New submission 131954 48.9%
1 Partial cancel 2917 1.1%
2 Full deletion 123458 45.8%
3 Visible execution 8974 3.3%
4 Hidden execution 2445 0.9%

Order activity over the trading session

Message arrival rate bucketed into 5-minute intervals across the full trading session (09:30 - 16:00 ET). The first and last 30 minutes typically show elevated activity due to the opening and closing auctions.

In [4]:
msg_session = msg[(msg['Time'] >= SESSION_START) & (msg['Time'] <= SESSION_END)].copy()
msg_session['MinuteBin'] = msg_session['Timestamp'].dt.floor('5min')

activity = msg_session.groupby('MinuteBin')['EventType'].count().rename('count')
trades = (
    msg_session[msg_session['EventType'].isin([4, 5])]
    .groupby('MinuteBin')['EventType']
    .count()
    .rename('trades')
)

plt.figure(figsize=(8, 4.5))
plt.fill_between(activity.index, activity.values, alpha=0.15, color=FL_BLUE)
plt.plot(activity.index, activity.values, color=FL_BLUE, linewidth=1.4)
plt.ylabel('Messages / 5 min')
plt.title('Order book message rate over the trading session')
plt.tick_params(axis='both', length=0)
plt.gca().yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda y, _: f'{y/1000:.0f}k' if y >= 1000 else f'{y:.0f}')
)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
plt.fill_between(trades.index, trades.values, alpha=0.15, color=FL_AMBER)
plt.plot(trades.index, trades.values, color=FL_AMBER, linewidth=1.4)
plt.ylabel('Executions / 5 min')
plt.title('Visible and hidden executions over the trading session')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
plt.gca().xaxis.set_major_locator(mdates.HourLocator())
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image

Mid-price and bid-ask spread

The mid-price is the average of the best bid and best ask at each order book snapshot. It is the standard reference price for LOB research and forms the target variable in mid-price movement prediction tasks.

The bid-ask spread (best ask - best bid) giving the price paid to trade against the resting limit order book.

In [5]:
book['MidPrice'] = (book['AskPrice1'] + book['BidPrice1']) / 2
book['Spread'] = book['AskPrice1'] - book['BidPrice1']

book_s = book[(book['Time'] >= SESSION_START) & (book['Time'] <= SESSION_END)].copy()
book_s = book_s.dropna(subset=['MidPrice', 'Spread'])
book_s = book_s[book_s['Spread'] > 0]

sample = book_s.iloc[::50]

plt.figure(figsize=(8, 4.5))
plt.plot(
    sample['Timestamp'],
    sample['MidPrice'],
    color=FL_BLUE,
    linewidth=1.2,
    label='Mid-price'
)
plt.ylabel('Price ($)')
plt.title('AMZN mid-price on 2012-06-21')
plt.tick_params(axis='both', length=0)
plt.gca().yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda y, _: f'${y:.2f}')
)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
plt.plot(
    sample['Timestamp'],
    sample['Spread'] * 10000,
    color=FL_AMBER,
    linewidth=0.9,
    alpha=0.8
)
plt.ylabel('Spread (ticks, 1 tick = $0.0001)')
plt.title('Bid ask spread in ticks')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
plt.gca().xaxis.set_major_locator(mdates.HourLocator())
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()

spread_summary = pd.DataFrame([
    {
        'Metric': 'Mid-price range',
        'Value': f'${book_s["MidPrice"].min():.2f} to ${book_s["MidPrice"].max():.2f}'
    },
    {
        'Metric': 'Median spread',
        'Value': f'{book_s["Spread"].median() * 10000:.1f} ticks (${book_s["Spread"].median():.4f})'
    },
    {
        'Metric': 'Mean spread',
        'Value': f'{book_s["Spread"].mean() * 10000:.1f} ticks (${book_s["Spread"].mean():.4f})'
    },
    {
        'Metric': 'Spread std',
        'Value': f'{book_s["Spread"].std() * 10000:.2f} ticks'
    }
])

display(spread_summary)
No description has been provided for this image
No description has been provided for this image
Metric Value
0 Mid-price range $220.52 to $226.03
1 Median spread 1300.0 ticks ($0.1300)
2 Mean spread 1308.6 ticks ($0.1309)
3 Spread std 567.88 ticks

Order book depth profile

Average cumulative volume available at each price level across the full trading session. Deeper levels accumulate more resting liquidity and the shape of this profile reflects how much volume market participants are willing to quote away from the best price.

In [6]:
ask_sizes = [book_s[f'AskSize{lvl}'].mean() for lvl in range(1, N_LEVELS + 1)]
bid_sizes = [book_s[f'BidSize{lvl}'].mean() for lvl in range(1, N_LEVELS + 1)]

cum_ask = np.cumsum(ask_sizes)
cum_bid = np.cumsum(bid_sizes)
levels = np.arange(1, N_LEVELS + 1)

width = 0.35

plt.figure(figsize=(8, 4.5))
plt.bar(
    levels - width / 2,
    bid_sizes,
    width=width,
    color=FL_GREEN,
    alpha=0.85,
    label='Bid'
)
plt.bar(
    levels + width / 2,
    ask_sizes,
    width=width,
    color=FL_RED,
    alpha=0.85,
    label='Ask'
)
plt.xlabel('Level')
plt.ylabel('Average volume (shares)')
plt.title('Average volume per level for bid vs ask')
plt.xticks(levels)
plt.legend()
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
plt.fill_between(levels, cum_bid, alpha=0.15, color=FL_GREEN)
plt.plot(
    levels,
    cum_bid,
    color=FL_GREEN,
    marker='o',
    markersize=5,
    linewidth=1.6,
    label='Bid cumulative'
)
plt.fill_between(levels, cum_ask, alpha=0.15, color=FL_RED)
plt.plot(
    levels,
    cum_ask,
    color=FL_RED,
    marker='o',
    markersize=5,
    linewidth=1.6,
    label='Ask cumulative'
)
plt.xlabel('Level')
plt.ylabel('Cumulative volume (shares)')
plt.title('Cumulative depth for bid vs ask across 10 levels')
plt.xticks(levels)
plt.legend()
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()

level_volume_df = pd.DataFrame({
    'Level': levels,
    'Bid avg': bid_sizes,
    'Ask avg': ask_sizes
})

level_volume_df[['Bid avg', 'Ask avg']] = (
    level_volume_df[['Bid avg', 'Ask avg']].round(0).astype(int)
)

display(level_volume_df)
No description has been provided for this image
No description has been provided for this image
Level Bid avg Ask avg
0 1 245 139
1 2 139 106
2 3 148 99
3 4 152 113
4 5 166 135
5 6 186 150
6 7 205 169
7 8 213 183
8 9 218 190
9 10 212 194

Limit order book snapshot

A single LOB snapshot shows the resting volume available at each price level at one instant in time. The chart below renders the book state at 10:00:00 ET.

Each horizontal bar represents one price level. Green bars = bid side (buyers). Red bars = ask side (sellers). The gap between the innermost bid and ask bars is the bid-ask spread.

In [7]:
TARGET_SEC = 36_000
snap_idx = book_s[book_s['Time'] >= TARGET_SEC].index[0]
snap = book.loc[snap_idx]

bid_prices = [snap[f'BidPrice{l}'] for l in range(1, N_LEVELS + 1)]
bid_sizes = [snap[f'BidSize{l}'] for l in range(1, N_LEVELS + 1)]
ask_prices = [snap[f'AskPrice{l}'] for l in range(1, N_LEVELS + 1)]
ask_sizes = [snap[f'AskSize{l}'] for l in range(1, N_LEVELS + 1)]

bid_data = [(p, s) for p, s in zip(bid_prices, bid_sizes) if pd.notna(p) and pd.notna(s) and s > 0]
ask_data = [(p, s) for p, s in zip(ask_prices, ask_sizes) if pd.notna(p) and pd.notna(s) and s > 0]

snap_time = snap['Timestamp'].strftime('%H:%M:%S') if hasattr(snap['Timestamp'], 'strftime') else '10:00:xx'

plt.figure(figsize=(8, 4.5))

if bid_data:
    bp, bs = zip(*bid_data)
    bid_height = min(0.005, (max(bp) - min(bp)) / (len(bp) * 1.5) if len(bp) > 1 else 0.005)
    plt.barh(
        bp,
        [-s for s in bs],
        left=0,
        color=FL_GREEN,
        alpha=0.8,
        height=bid_height,
        label='Bid'
    )

if ask_data:
    ap, as_ = zip(*ask_data)
    ask_height = min(0.005, (max(ap) - min(ap)) / (len(ap) * 1.5) if len(ap) > 1 else 0.005)
    plt.barh(
        ap,
        as_,
        left=0,
        color=FL_RED,
        alpha=0.8,
        height=ask_height,
        label='Ask'
    )

plt.axvline(0, color=FL_BORDER, linewidth=1)
plt.xlabel('Volume (shares) | left = bid, right = ask')
plt.ylabel('Price ($)')
plt.title(f'AMZN limit order book snapshot at {snap_time} ET (10 levels)')
plt.gca().yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda y, _: f'${y:.2f}')
)
plt.gca().xaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f'{abs(x):,.0f}')
)
plt.legend(loc='lower right')
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()

if bid_data and ask_data:
    mid = (bid_data[0][0] + ask_data[0][0]) / 2
    spread = ask_data[0][0] - bid_data[0][0]

    snapshot_df = pd.DataFrame([
        {
            'Metric': 'Snapshot time',
            'Value': f'{snap_time} ET'
        },
        {
            'Metric': 'Best bid',
            'Value': f'${bid_data[0][0]:.4f} ({bid_data[0][1]:,} shares)'
        },
        {
            'Metric': 'Best ask',
            'Value': f'${ask_data[0][0]:.4f} ({ask_data[0][1]:,} shares)'
        },
        {
            'Metric': 'Mid-price',
            'Value': f'${mid:.4f}'
        },
        {
            'Metric': 'Spread',
            'Value': f'${spread:.4f}'
        },
        {
            'Metric': 'Spread (ticks)',
            'Value': f'{spread * 10000:.0f}'
        }
    ])

    display(snapshot_df)
No description has been provided for this image
Metric Value
0 Snapshot time 10:00:00 ET
1 Best bid $224.1700 (200 shares)
2 Best ask $224.3500 (100 shares)
3 Mid-price $224.2600
4 Spread $0.1800
5 Spread (ticks) 1800

Trade analysis

Executions (event types 4 and 5) represent actual trades - a market order hitting a resting limit order. The direction field indicates the aggressor side: a buy-initiated trade executes against a resting sell limit order (ask side) and vice versa.

In [8]:
trades = msg[msg['EventType'].isin([4, 5])].copy()
trades = trades[(trades['Time'] >= SESSION_START) & (trades['Time'] <= SESSION_END)]

trades['DollarValue'] = trades['Price'] * trades['Size']
trades['Bin'] = trades['Timestamp'].dt.floor('30min')

vwap = trades.groupby('Bin').apply(
    lambda g: (g['Price'] * g['Size']).sum() / g['Size'].sum()
).rename('VWAP')

vol_bin = trades.groupby('Bin')['Size'].sum().rename('Volume')

plt.figure(figsize=(8, 4.5))
plt.hist(
    trades['Size'].clip(upper=2000),
    bins=50,
    color=FL_BLUE,
    alpha=0.8,
    edgecolor='none'
)
plt.xlabel('Trade size (shares, clipped at 2,000)')
plt.ylabel('Frequency')
plt.title('Trade size distribution')
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
plt.bar(
    vol_bin.index,
    vol_bin.values,
    width=pd.Timedelta('28min'),
    color=FL_SLATE,
    alpha=0.85
)
plt.ylabel('Total volume (shares)')
plt.title('Intraday volume in 30 minute buckets')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
plt.gca().xaxis.set_major_locator(mdates.HourLocator(interval=2))
plt.gca().yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda y, _: f'{y/1000:.0f}k' if y >= 1000 else f'{y:.0f}')
)
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
plt.step(
    vwap.index,
    vwap.values,
    color=FL_AMBER,
    linewidth=1.6,
    where='post'
)
plt.ylabel('VWAP ($)')
plt.title('30 minute VWAP over the session')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
plt.gca().xaxis.set_major_locator(mdates.HourLocator(interval=2))
plt.gca().yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda y, _: f'${y:.2f}')
)
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()

trade_summary = pd.DataFrame([
    {
        'Metric': 'Total trades',
        'Value': f'{len(trades):,}'
    },
    {
        'Metric': 'Total volume',
        'Value': f'{trades["Size"].sum():,} shares'
    },
    {
        'Metric': 'Total dollar volume',
        'Value': f'${trades["DollarValue"].sum():,.0f}'
    },
    {
        'Metric': 'Median trade size',
        'Value': f'{trades["Size"].median():.0f} shares'
    },
    {
        'Metric': 'Mean trade size',
        'Value': f'{trades["Size"].mean():.1f} shares'
    },
    {
        'Metric': 'Buy-initiated trades',
        'Value': f'{(trades["Direction"] == -1).sum():,} ({(trades["Direction"] == -1).mean():.1%})'
    },
    {
        'Metric': 'Sell-initiated trades',
        'Value': f'{(trades["Direction"] == 1).sum():,} ({(trades["Direction"] == 1).mean():.1%})'
    }
])

display(trade_summary)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Metric Value
0 Total trades 11,419
1 Total volume 810,755 shares
2 Total dollar volume $180,502,249
3 Median trade size 66 shares
4 Mean trade size 71.0 shares
5 Buy-initiated trades 5,904 (51.7%)
6 Sell-initiated trades 5,515 (48.3%)

LOB evolution over a 10-minute window

Tracking how the best bid, best ask, mid-price, and total depth at level 1 evolve over a representative 10-minute window (10:00 - 10:10 ET). This illustrates the continuous state changes driven by order submissions, cancellations, and executions.

In [9]:
WIN_START = 36_000
WIN_END = 36_600

win = book_s[(book_s['Time'] >= WIN_START) & (book_s['Time'] < WIN_END)].copy()

plt.figure(figsize=(8, 4.5))
plt.plot(
    win['Timestamp'],
    win['BidPrice1'],
    color=FL_GREEN,
    linewidth=1.0,
    alpha=0.8,
    label='Best bid'
)
plt.plot(
    win['Timestamp'],
    win['AskPrice1'],
    color=FL_RED,
    linewidth=1.0,
    alpha=0.8,
    label='Best ask'
)
plt.plot(
    win['Timestamp'],
    win['MidPrice'],
    color=FL_BLUE,
    linewidth=1.4,
    label='Mid-price'
)
plt.fill_between(
    win['Timestamp'],
    win['BidPrice1'],
    win['AskPrice1'],
    alpha=0.07,
    color=FL_SLATE
)
plt.ylabel('Price ($)')
plt.title('Best bid, best ask, and mid-price from 10:00 to 10:10 ET')
plt.gca().yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda y, _: f'${y:.2f}')
)
plt.legend(loc='upper left', fontsize=9)
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
plt.fill_between(win['Timestamp'], win['Spread'] * 10000, alpha=0.3, color=FL_AMBER)
plt.plot(win['Timestamp'], win['Spread'] * 10000, color=FL_AMBER, linewidth=1.0)
plt.ylabel('Spread (ticks)')
plt.title('Bid ask spread from 10:00 to 10:10 ET')
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
plt.fill_between(win['Timestamp'], win['BidSize1'], alpha=0.2, color=FL_GREEN, label='Bid L1')
plt.fill_between(win['Timestamp'], win['AskSize1'], alpha=0.2, color=FL_RED, label='Ask L1')
plt.plot(win['Timestamp'], win['BidSize1'], color=FL_GREEN, linewidth=0.8)
plt.plot(win['Timestamp'], win['AskSize1'], color=FL_RED, linewidth=0.8)
plt.ylabel('Volume at L1 (shares)')
plt.title('Level 1 resting volume from 10:00 to 10:10 ET')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M:%S'))
plt.gca().xaxis.set_major_locator(mdates.MinuteLocator(interval=2))
plt.legend(loc='upper right', fontsize=9)
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Spread and mid-price return distribution

The spread distribution and the distribution of mid-price changes between consecutive events. Most mid-price moves are zero (the book updates without crossing the mid-price), with rare large moves during periods of elevated volatility.

In [10]:
spread_ticks = (book_s['Spread'] * 10000).round().astype(int)
spread_counts = spread_ticks.value_counts().sort_index().head(20)

mid_changes = book_s['MidPrice'].diff().dropna()
mid_changes_ticks = (mid_changes * 10000).round().astype(int)

plt.figure(figsize=(8, 4.5))
plt.bar(
    spread_counts.index,
    spread_counts.values,
    color=FL_BLUE,
    alpha=0.85,
    width=0.6
)
plt.xlabel('Spread (ticks)')
plt.ylabel('Frequency')
plt.title('Bid ask spread distribution in ticks')
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()

change_counts = mid_changes_ticks.value_counts().sort_index()
change_counts = change_counts[(change_counts.index >= -10) & (change_counts.index <= 10)]
colors = [FL_GREEN if i > 0 else (FL_RED if i < 0 else FL_SLATE) for i in change_counts.index]

plt.figure(figsize=(8, 4.5))
plt.bar(
    change_counts.index,
    change_counts.values,
    color=colors,
    alpha=0.85,
    width=0.6
)
plt.xlabel('Mid-price change (ticks)')
plt.ylabel('Frequency')
plt.title('Mid-price change per event within plus or minus 10 ticks')
plt.tick_params(axis='both', length=0)
plt.gca().yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda y, _: f'{y/1000:.0f}k' if y >= 1000 else f'{y:.0f}')
)
plt.tight_layout()
plt.show()

zero_pct = (mid_changes_ticks == 0).mean()

micro_summary = pd.DataFrame([
    {
        'Metric': 'Mid-price unchanged',
        'Value': f'{zero_pct:.1%}'
    },
    {
        'Metric': 'Up moves',
        'Value': f'{(mid_changes_ticks > 0).mean():.1%}'
    },
    {
        'Metric': 'Down moves',
        'Value': f'{(mid_changes_ticks < 0).mean():.1%}'
    },
    {
        'Metric': 'Most common spread',
        'Value': f'{spread_counts.idxmax()} ticks ({spread_counts.max() / len(book_s):.1%} of snapshots)'
    }
])

display(micro_summary)
No description has been provided for this image
No description has been provided for this image
Metric Value
0 Mid-price unchanged 89.8%
1 Up moves 5.1%
2 Down moves 5.1%
3 Most common spread 1200 ticks (8.0% of snapshots)