Ctrl K

FI-2010 Benchmark Dataset - EDA

The FI-2010 dataset is the standard benchmark for limit order book mid-price movement prediction research. It covers ten trading days of five Finnish stocks from the Helsinki Stock Exchange (June 1-14, 2010), sampled from the Nasdaq Nordic exchange, reconstructed at 10 levels of depth.

The dataset is structured as a supervised classification problem: given a 144-dimensional snapshot of the order book state, predict whether the mid-price will move up, down, or remain stationary over the next k events (k = 1, 2, 3, 5, 10).

Data source: Ntakaris et al. (2018), available via the authors upon request.
Original dataset: https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649

This notebook uses a subset (fi2010_subset.npz): 2,000 rows sampled from each of the 9 cross-validation folds' test splits (18,000 rows total, ~10 MB). The full dataset contains over 300 thousand rows across train and test. All EDA code works identically on the full dataset. Replace fi2010_subset.npz with the full .txt files via the loader in the methodology section.

Parameter Value
Stocks 5 Finnish stocks, Helsinki Stock Exchange
Period June 1–14, 2010 (10 trading days)
LOB depth 10 levels (bid and ask)
Features 144 (40 raw LOB + 104 time-series derived)
Label horizons k = 1, 2, 3, 5, 10 events ahead
Label classes 1 = down, 2 = stationary, 3 = up
Normalisation Z-score (NoAuction variant used here)
CV splits 9 anchored walk-forward folds
Subset rows 18,000 (2,000 per fold)

Related papers

  • Benchmark Dataset for Mid-Price Forecasting of Limit Order Book Data with Machine Learning Methods
    Ntakaris, A., Magris, M., Kanniainen, J., Gabbouj, M., and Iosifidis, A.
    Journal of Forecasting, 37(8), 852-866, 2018

  • DeepLOB: Deep Learning for Limit Order Books
    Zhang, Z., Zohren, S., and Roberts, S.
    IEEE Transactions on Signal Processing, 2019

  • Temporal Attention Augmented Bilinear Network for Financial Time-Series Data Analysis
    Tran, D.T., Iosifidis, A., Kanniainen, J., and Gabbouj, M.
    IEEE Transactions on Neural Networks and Learning Systems, 2019

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import warnings
warnings.filterwarnings('ignore')

from pathlib import Path
from IPython.display import display, HTML

FL_BLUE   = '#2563eb'
FL_SLATE  = '#64748b'
FL_AMBER  = '#f59e0b'
FL_GREEN  = '#16a34a'
FL_RED    = '#ef4444'
FL_BG     = '#ffffff'
FL_GRID   = '#e2e8f0'
FL_TEXT   = '#0f172a'
FL_TEXT2  = '#334155'
FL_BORDER = '#e2e8f0'

matplotlib.rcParams.update({
    'figure.facecolor':  FL_BG,
    'axes.facecolor':    FL_BG,
    'axes.edgecolor':    FL_BORDER,
    'axes.labelcolor':   FL_TEXT2,
    'axes.spines.top':   False,
    'axes.spines.right': False,
    'axes.grid':         True,
    'grid.color':        FL_GRID,
    'grid.linewidth':    0.7,
    'xtick.color':       FL_TEXT2,
    'ytick.color':       FL_TEXT2,
    'xtick.labelsize':   10,
    'ytick.labelsize':   10,
    'axes.labelsize':    11,
    'axes.titlesize':    12,
    'axes.titlecolor':   FL_TEXT,
    'axes.titlepad':     12,
    'legend.frameon':    False,
    'legend.fontsize':   10,
    'figure.dpi':        300,
    'savefig.bbox':      'tight',
    'font.family':       'sans-serif',
    'font.sans-serif':   ['Inter', 'Helvetica Neue', 'Arial', 'DejaVu Sans'],
})

# To use the full dataset instead, replace this block with load_fi2010_txt()
# defined in the Methodology section at the bottom of this notebook.
SUBSET_PATH = Path('data/fi2010_subset.npz')
arc = np.load(SUBSET_PATH, allow_pickle=True)

X              = arc['X']               # (N, 144) float32
Y              = arc['Y']               # (N, 5)   int8  - columns = k=1,2,3,5,10
CF             = arc['cf']              # (N,)     int8  - fold index 1–9
HORIZON_LABELS = arc['horizon_labels']  # [1,2,3,5,10]
FEATURE_NAMES  = arc['feature_names']   # 144 strings

N, D = X.shape
print(f'Subset loaded:    {N:,} rows × {D} features')
#print(f'Label matrix:     {Y.shape}  (5 horizons)')
#print(f'Horizons:         {HORIZON_LABELS}')
#print(f'Folds present:    {np.unique(CF)}')
#print(f'Rows per fold:    {np.bincount(CF)[1:]}')

LABEL_MAP = {1: 'Down', 2: 'Stationary', 3: 'Up'}
LABEL_COLORS = {1: FL_RED, 2: FL_SLATE, 3: FL_GREEN}
Subset loaded:    18,000 rows × 144 features

Dataset structure

Each row in the feature matrix represents a single order book state snapshot. The 144 features are organised as:

  • Features 1-40: The raw LOB state - 10 bid prices, 10 ask prices, 10 bid sizes, 10 ask sizes (all z-score normalised)
  • Features 41-144: Time-series derived features - three lags of the 40 raw features plus 24 additional statistics (means, differences, absolute differences)

The label matrix has 5 columns, one per prediction horizon. Each label is 1 (down), 2 (stationary), or 3 (up), representing the direction of the mid-price change over the next k order book events.

In [2]:
raw_lob = FEATURE_NAMES[:40]
derived = FEATURE_NAMES[40:]

summary_rows = []
for j, h in enumerate(HORIZON_LABELS):
    vals, cnts = np.unique(Y[:, j], return_counts=True)
    d = {LABEL_MAP[int(v)]: f'{c:,} ({c / N:.1%})' for v, c in zip(vals, cnts)}
    summary_rows.append({
        'Horizon k': int(h),
        'Down (1)': d.get('Down', '-'),
        'Stationary (2)': d.get('Stationary', '-'),
        'Up (3)': d.get('Up', '-'),
    })

feature_summary_df = pd.DataFrame([
    {
        'Metric': 'Feature matrix',
        'Value': f'{N:,} rows x {D} columns'
    },
    {
        'Metric': 'Raw LOB features',
        'Value': f'features 1 to 40 ({len(raw_lob)} features)'
    },
    {
        'Metric': 'Derived features',
        'Value': f'features 41 to 144 ({len(derived)} features)'
    }
])

label_summary_df = pd.DataFrame(summary_rows)

print('Feature summary')
display(feature_summary_df)

print('Label distribution by horizon')
display(label_summary_df)
Feature summary
Metric Value
0 Feature matrix 18,000 rows x 144 columns
1 Raw LOB features features 1 to 40 (40 features)
2 Derived features features 41 to 144 (104 features)
Label distribution by horizon
Horizon k Down (1) Stationary (2) Up (3)
0 1 5,747 (31.9%) 6,392 (35.5%) 5,861 (32.6%)
1 2 6,473 (36.0%) 4,853 (27.0%) 6,674 (37.1%)
2 3 7,029 (39.1%) 3,803 (21.1%) 7,168 (39.8%)
3 5 7,593 (42.2%) 2,642 (14.7%) 7,765 (43.1%)
4 10 8,057 (44.8%) 1,624 (9.0%) 8,319 (46.2%)

Label distribution across horizons

The dataset is intentionally class-imbalanced - the stationary class dominates at short horizons because most order book events do not move the mid-price. As the horizon grows (k=10), stationary cases decrease and directional moves become more balanced. This imbalance is a key challenge for all LOB prediction models.

In [3]:
selected_horizons = [1, 3, 10]

for h in selected_horizons:
    h_idx = np.where(HORIZON_LABELS == h)[0][0]

    vals, cnts = np.unique(Y[:, h_idx], return_counts=True)
    labels = [LABEL_MAP[int(v)] for v in vals]
    colors = [LABEL_COLORS[int(v)] for v in vals]
    pcts = cnts / cnts.sum() * 100

    plt.figure(figsize=(8, 4.5))
    bars = plt.bar(labels, pcts, color=colors, alpha=0.85, width=0.5)

    for bar, pct in zip(bars, pcts):
        plt.text(
            bar.get_x() + bar.get_width() / 2,
            bar.get_height() + 0.5,
            f'{pct:.1f}%',
            ha='center',
            fontsize=9,
            color=FL_TEXT2
        )

    plt.title(f'Label distribution for k = {h}')
    plt.ylabel('Share (%)')
    plt.ylim(0, max(pcts) * 1.15)
    plt.tick_params(axis='both', length=0)
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Label distribution across CV folds (k=1)

The 9 cross-validation folds are anchored walk-forward splits - each fold corresponds to a later time window, with training data always preceding test data. The label distribution varies across folds, reflecting different market regimes on different days.

In [4]:
h_idx = 0
folds = np.unique(CF)

fold_dist = []
for cf in folds:
    mask = CF == cf
    y_cf = Y[mask, h_idx]
    vals, cnts = np.unique(y_cf, return_counts=True)

    row = {
        'Fold': int(cf),
        'N': int(mask.sum())
    }

    for v in [1, 2, 3]:
        idx = np.where(vals == v)[0]
        row[LABEL_MAP[v]] = f'{cnts[idx[0]] / mask.sum():.1%}' if len(idx) else '0.0%'

    fold_dist.append(row)

fold_dist_df = pd.DataFrame(fold_dist)

print('Label distribution per fold for k = 1')
display(fold_dist_df)

plt.figure(figsize=(8, 4.5))
bottoms = np.zeros(len(folds))

for cls, color in [(1, FL_RED), (2, FL_SLATE), (3, FL_GREEN)]:
    pcts = []
    for cf in folds:
        mask = CF == cf
        pcts.append((Y[mask, h_idx] == cls).mean() * 100)

    plt.bar(
        folds,
        pcts,
        bottom=bottoms,
        color=color,
        alpha=0.85,
        label=LABEL_MAP[cls],
        width=0.6
    )
    bottoms += np.array(pcts)

plt.xlabel('CV fold')
plt.ylabel('Share (%)')
plt.title('Label distribution per fold for k = 1')
plt.xticks(folds)
plt.legend(loc='upper right')
plt.tick_params(axis='both', length=0)
plt.tight_layout()
plt.show()
Label distribution per fold for k = 1
Fold N Down Stationary Up
0 1 2000 31.4% 36.4% 32.2%
1 2 2000 23.6% 49.2% 27.2%
2 3 2000 27.8% 37.1% 35.1%
3 4 2000 35.9% 29.2% 34.9%
4 5 2000 35.6% 31.7% 32.6%
5 6 2000 34.6% 30.6% 34.7%
6 7 2000 34.3% 35.5% 30.2%
7 8 2000 31.5% 32.5% 36.0%
8 9 2000 32.6% 37.2% 30.1%
No description has been provided for this image

Raw LOB feature statistics

The first 40 features are the z-score normalised order book state: 10 bid prices, 10 ask prices, 10 bid sizes, 10 ask sizes. After normalisation, prices cluster near zero with unit variance. Size features show heavier tails due to occasional large orders.

In [5]:
X_raw = X[:, :40]
feat_mean = X_raw.mean(axis=0)
feat_std = X_raw.std(axis=0)

feature_groups = (
    [f'BidP{i}' for i in range(1, 11)] +
    [f'AskP{i}' for i in range(1, 11)] +
    [f'BidS{i}' for i in range(1, 11)] +
    [f'AskS{i}' for i in range(1, 11)]
)

group_colors = (
    [FL_GREEN] * 10 +
    [FL_RED] * 10 +
    [FL_BLUE] * 10 +
    [FL_AMBER] * 10
)

import matplotlib.patches as mpatches

legend = [
    mpatches.Patch(color=FL_GREEN, label='Bid prices (1 to 10)'),
    mpatches.Patch(color=FL_RED, label='Ask prices (1 to 10)'),
    mpatches.Patch(color=FL_BLUE, label='Bid sizes (1 to 10)'),
    mpatches.Patch(color=FL_AMBER, label='Ask sizes (1 to 10)'),
]

plt.figure(figsize=(8, 4.5))
plt.bar(range(40), feat_mean, color=group_colors, alpha=0.8, width=0.7)
plt.axhline(0, color=FL_GRID, linewidth=0.8)
plt.ylabel('Mean (z-score)')
plt.title('Raw LOB features: mean value after z-score normalization')
plt.xticks([])
plt.legend(handles=legend, fontsize=9, loc='upper right')
plt.tick_params(length=0)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
plt.bar(range(40), feat_std, color=group_colors, alpha=0.8, width=0.7)
plt.ylabel('Std (z-score)')
plt.title('Raw LOB features: standard deviation')
plt.xticks(
    range(0, 40, 5),
    [feature_groups[i] for i in range(0, 40, 5)],
    rotation=30,
    ha='right'
)
plt.tick_params(length=0)
plt.tight_layout()
plt.show()

raw_feature_stats_df = pd.DataFrame({
    'Feature': feature_groups,
    'Mean': feat_mean,
    'Std': feat_std
})

raw_feature_stats_df[['Mean', 'Std']] = raw_feature_stats_df[['Mean', 'Std']].round(4)

print('Raw LOB feature summary')
display(raw_feature_stats_df.head())
No description has been provided for this image
No description has been provided for this image
Raw LOB feature summary
Feature Mean Std
0 BidP1 0.3906 0.1625
1 BidP2 -0.4987 0.3760
2 BidP3 0.3891 0.1623
3 BidP4 -0.4689 0.2286
4 BidP5 0.3912 0.1625

Feature correlation structure

Correlation matrix of the 40 raw LOB features. Bid and ask price levels are highly correlated with each other (the order book is a contiguous price ladder). Size features show lower correlation - volume at each level varies more independently.

In [6]:
corr = np.corrcoef(X[:, :40].T)

plt.figure(figsize=(8, 6))
im = plt.imshow(corr, cmap='RdYlGn', vmin=-1, vmax=1, aspect='auto')
plt.colorbar(im, fraction=0.03, pad=0.02)

for boundary in [10, 20, 30]:
    plt.axhline(boundary - 0.5, color='white', linewidth=1.5)
    plt.axvline(boundary - 0.5, color='white', linewidth=1.5)

plt.xticks(
    [5, 15, 25, 35],
    ['Bid prices', 'Ask prices', 'Bid sizes', 'Ask sizes'],
    fontsize=10
)
plt.yticks(
    [5, 15, 25, 35],
    ['Bid prices', 'Ask prices', 'Bid sizes', 'Ask sizes'],
    fontsize=10
)
plt.title('Raw LOB feature correlation matrix (40 features)')
plt.tick_params(length=0)
plt.tight_layout()
plt.show()
No description has been provided for this image

Mid-price proxy series

The raw dataset does not include a direct price column - prices are z-score normalised. Feature 1 (BidPrice1, the best bid) and Feature 11 (AskPrice1, the best ask) can be combined to form a normalised mid-price proxy. Plotting this across a single fold shows the characteristic patterns that models are trained to predict.

In [7]:
fold1_mask = CF == 1
bid1 = X[fold1_mask, 0]
ask1 = X[fold1_mask, 10]
mid = (bid1 + ask1) / 2
spread = ask1 - bid1

n_fold = fold1_mask.sum()
idx = np.arange(n_fold)

plt.figure(figsize=(8, 4.5))
plt.plot(idx, mid, color=FL_BLUE, linewidth=0.8, alpha=0.9)
plt.ylabel('Normalized mid-price')
plt.title('Fold 1 normalized mid-price proxy (BidP1 + AskP1) / 2')
plt.tick_params(length=0)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
plt.fill_between(idx, spread, alpha=0.3, color=FL_AMBER)
plt.plot(idx, spread, color=FL_AMBER, linewidth=0.6)
plt.ylabel('Normalized spread')
plt.title('Bid ask spread proxy (AskP1 - BidP1, z-score units)')
plt.tick_params(length=0)
plt.tight_layout()
plt.show()

y_fold1 = Y[fold1_mask, 0]

plt.figure(figsize=(8, 4.5))
for cls, color in [(1, FL_RED), (3, FL_GREEN)]:
    mask_cls = y_fold1 == cls
    plt.scatter(
        idx[mask_cls],
        mid[mask_cls],
        c=color,
        s=1,
        alpha=0.4,
        label=LABEL_MAP[cls]
    )

plt.scatter(
    idx[y_fold1 == 2],
    mid[y_fold1 == 2],
    c=FL_SLATE,
    s=0.5,
    alpha=0.2,
    label='Stationary'
)

plt.ylabel('Normalized mid-price')
plt.xlabel('Event index (fold 1)')
plt.title('Mid-price colored by k = 1 label')
plt.legend(markerscale=6, fontsize=9, loc='upper right')
plt.tick_params(length=0)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Order book depth profile

Average z-score normalised size at each of the 10 bid and ask levels. Level 1 (best price) carries the most liquidity pressure. Deeper levels accumulate progressively more resting volume - a pattern consistent with the LOBSTER data examined in the AMZN EDA.

In [8]:
bid_size_feats = X[:, 20:30].mean(axis=0)
ask_size_feats = X[:, 30:40].mean(axis=0)
levels = np.arange(1, 11)

width = 0.35

plt.figure(figsize=(8, 4.5))
plt.bar(
    levels - width / 2,
    bid_size_feats,
    width=width,
    color=FL_GREEN,
    alpha=0.85,
    label='Bid size'
)
plt.bar(
    levels + width / 2,
    ask_size_feats,
    width=width,
    color=FL_RED,
    alpha=0.85,
    label='Ask size'
)
plt.xlabel('Level')
plt.ylabel('Mean z-score normalized size')
plt.title('Average order book depth per level')
plt.xticks(levels)
plt.axhline(0, color=FL_GRID, linewidth=0.8)
plt.legend()
plt.tick_params(length=0)
plt.tight_layout()
plt.show()

depth_df = pd.DataFrame({
    'Level': levels,
    'Bid size mean': bid_size_feats,
    'Ask size mean': ask_size_feats
})

depth_df[['Bid size mean', 'Ask size mean']] = depth_df[['Bid size mean', 'Ask size mean']].round(4)

display(depth_df)
No description has been provided for this image
Level Bid size mean Ask size mean
0 1 0.3930 0.3848
1 2 -0.6110 -0.4346
2 3 0.3865 0.3947
3 4 -0.6324 -0.3377
4 5 0.3936 0.3838
5 6 -0.4682 -0.3653
6 7 0.3857 0.3955
7 8 -0.5509 -0.3405
8 9 0.3941 0.3828
9 10 -0.3735 -0.3402

Label entropy and class balance by horizon

Shannon entropy of the label distribution measures how predictable the task is: lower entropy = more imbalanced = the majority class dominates. As the horizon grows, the distribution becomes more balanced and entropy rises, making the prediction task structurally harder despite the longer lookahead.

In [9]:
from scipy.stats import entropy as shannon_entropy

rows = []
for j, h in enumerate(HORIZON_LABELS):
    vals, cnts = np.unique(Y[:, j], return_counts=True)
    probs = cnts / cnts.sum()
    ent = float(shannon_entropy(probs, base=2))
    majority_pct = probs.max()

    rows.append({
        'Horizon k': int(h),
        'Down %': f'{probs[vals == 1][0]:.1%}' if 1 in vals else '0%',
        'Stationary %': f'{probs[vals == 2][0]:.1%}' if 2 in vals else '0%',
        'Up %': f'{probs[vals == 3][0]:.1%}' if 3 in vals else '0%',
        'Majority class': f'{majority_pct:.1%}',
        'Shannon entropy': round(ent, 3),
    })

entropy_df = pd.DataFrame(rows)

print('Label balance summary by horizon')
display(entropy_df)

entropies = entropy_df['Shannon entropy'].tolist()

plt.figure(figsize=(8, 4.5))
plt.plot(HORIZON_LABELS, entropies, marker='o', color=FL_BLUE, linewidth=2)
plt.axhline(
    np.log2(3),
    color=FL_GRID,
    linewidth=1,
    linestyle='--',
    label='Max entropy (uniform)'
)
plt.xlabel('Prediction horizon k')
plt.ylabel('Shannon entropy (bits)')
plt.title('Label entropy vs horizon')
plt.xticks(HORIZON_LABELS)
plt.legend()
plt.tick_params(length=0)
plt.tight_layout()
plt.show()
Label balance summary by horizon
Horizon k Down % Stationary % Up % Majority class Shannon entropy
0 1 31.9% 35.5% 32.6% 35.5% 1.583
1 2 36.0% 27.0% 37.1% 37.1% 1.571
2 3 39.1% 21.1% 39.8% 39.8% 1.533
3 5 42.2% 14.7% 43.1% 43.1% 1.455
4 10 44.8% 9.0% 46.2% 46.2% 1.347
No description has been provided for this image