Python Files and Data Objects Reference

This notebook is a practical reference for common Python file, folder, array, and data export operations.

It uses a small local project layout:

data/
outputs/

The data/ folder is used for example input files. The outputs/ folder is used for generated text files, JSON files, NumPy archives, pickle files, CSV files, and Parquet files.

Install packages

The examples use the Python standard library, NumPy, pandas, and Parquet support through PyArrow.

In [1]:

# Run this once if the packages are missing.
# !pip install numpy pandas pyarrow

Imports and project folders

Use pathlib.Path for paths. It keeps file and folder operations readable and avoids manual string joining.

In [2]:

from pathlib import Path
import json
import pickle

import numpy as np
import pandas as pd

DATA_DIR = Path("data")
OUTPUT_DIR = Path("outputs")

DATA_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# print(DATA_DIR.resolve())
# print(OUTPUT_DIR.resolve())

Use pathlib for paths

Path creates a filesystem path object. Plain strings also work in many places, but Path is cleaner when joining folders, creating directories, reading files, and writing files.

In [3]:

base_dir = Path("outputs")
data_dir = base_dir / "data"
file_path = data_dir / "sample.csv"

print(base_dir)
print(data_dir)
print(file_path)

outputs
outputs/data
outputs/data/sample.csv

Key points:

The / operator joins path parts.
Path objects work with pandas, NumPy, JSON, pickle, and normal file operations.
Use strings for tiny examples, but prefer Path when a script manages files and folders.

Create folders safely

mkdir creates a folder. parents=True also creates missing parent folders. exist_ok=True avoids an error if the folder already exists.

In [4]:

run_dir = OUTPUT_DIR / "run_001"
tables_dir = run_dir / "tables"
arrays_dir = run_dir / "arrays"
objects_dir = run_dir / "objects"
logs_dir = run_dir / "logs"

for folder in [tables_dir, arrays_dir, objects_dir, logs_dir]:
    folder.mkdir(parents=True, exist_ok=True)

print(tables_dir.exists())
print(arrays_dir.exists())
print(objects_dir.exists())
print(logs_dir.exists())

True
True
True
True

The equivalent Bash command is:

In [5]:

# mkdir -p outputs/run_001/tables

Write and read small text files

write_text writes a plain text file. If the file already exists, it is overwritten. This is useful for marker files, run names, small logs, and simple notes.

In [6]:

active_name = "run_001"
marker_path = OUTPUT_DIR / "current_run.txt"

marker_path.write_text(active_name)

loaded_name = marker_path.read_text().strip()

print(loaded_name)

run_001

Useful details:

write_text is best for small text files.
read_text loads the whole file as a string.
strip removes trailing newline and surrounding whitespace.

Create an example CSV input file

The later pandas examples need a real input file. This cell creates a small dataset under data/.

In [7]:

raw_df = pd.DataFrame({
    "date": pd.to_datetime(["2026-01-01", "2026-01-08", "2026-01-15", "2026-01-22"]),
    "actual": [100.0, 102.0, 101.0, 104.0],
    "predicted": [99.5, 101.7, 101.4, 103.2],
    "volume": [1200, 1350, 1280, 1420]
})

input_csv_path = DATA_DIR / "sample_predictions.csv"
raw_df.to_csv(input_csv_path, index=False)

print(input_csv_path)
print(raw_df)

data/sample_predictions.csv
        date  actual  predicted  volume
0 2026-01-01   100.0       99.5    1200
1 2026-01-08   102.0      101.7    1350
2 2026-01-15   101.0      101.4    1280
3 2026-01-22   104.0      103.2    1420

Read CSV input with pandas

pd.read_csv loads a CSV file into a DataFrame. Use parse_dates when a column should become a datetime column.

In [8]:

df = pd.read_csv(input_csv_path, parse_dates=["date"])

print(df.dtypes)
df

date         datetime64[us]
actual              float64
predicted           float64
volume                int64
dtype: object

Out[8]:

	date	actual	predicted	volume
0	2026-01-01	100.0	99.5	1200
1	2026-01-08	102.0	101.7	1350
2	2026-01-15	101.0	101.4	1280
3	2026-01-22	104.0	103.2	1420

Save JSON config files

json.dump writes a Python dictionary to a JSON file. The indent option only changes readability, not the data itself.

In [9]:

config = {
    "seed": 42,
    "device": "cpu",
    "name": "example_run",
    "window": 12,
    "input_file": str(input_csv_path),
    "output_dir": str(run_dir)
}

config_path = run_dir / "config.json"

with open(config_path, "w") as f:
    json.dump(config, f, indent=2)

with open(config_path) as f:
    loaded_config = json.load(f)

print(loaded_config)

{'seed': 42, 'device': 'cpu', 'name': 'example_run', 'window': 12, 'input_file': 'data/sample_predictions.csv', 'output_dir': 'outputs/run_001'}

Common JSON indent choices:

Option	Meaning
`indent=1`	Pretty JSON with one space per level.
`indent=2`	Common compact readable JSON.
`indent=4`	More spaced readable JSON.
no indent	Compact one-line JSON.

Build arrays with hstack and vstack

hstack joins arrays side by side. vstack stacks rows on top of each other. This pattern is common when building feature matrices before training a model.

In [10]:

price_window = np.array([100, 101, 103])
count_window = np.array([20, 18, 22])
extra_features = np.array([0.4, -0.1])

row = np.hstack([price_window, count_window, extra_features])

row_2 = np.hstack([
    np.array([101, 103, 104]),
    np.array([18, 22, 25]),
    np.array([0.2, 0.3])
])

X = np.vstack([row, row_2])

print(row)
print(X)
print(X.shape)

[ 1.00e+02  1.01e+02  1.03e+02  2.00e+01  1.80e+01  2.20e+01  4.00e-01
 -1.00e-01]
[[ 1.00e+02  1.01e+02  1.03e+02  2.00e+01  1.80e+01  2.20e+01  4.00e-01
  -1.00e-01]
 [ 1.01e+02  1.03e+02  1.04e+02  1.80e+01  2.20e+01  2.50e+01  2.00e-01
   3.00e-01]]
(2, 8)

Key points:

hstack creates one flat feature row.
vstack creates the full two-dimensional matrix.
Each row should have the same number of columns before stacking.

Use reshape for expected dimensions

reshape changes the array shape without changing the values. reshape(-1, 1) turns a flat array into a column vector.

In [11]:

a = np.array([10, 20, 30])

column = a.reshape(-1, 1)
row = a.reshape(1, -1)

print(a.shape)
print(column.shape)
print(row.shape)

(3,)
(3, 1)
(1, 3)

Expression	Result shape	Use case
`a`	`(3,)`	Flat one-dimensional array.
`a.reshape(-1, 1)`	`(3, 1)`	Column vector.
`a.reshape(1, -1)`	`(1, 3)`	Single row matrix.

Save and load NPZ files

NPZ is a NumPy archive format. It stores multiple named arrays in one file. Compressed NPZ files are useful for cached feature matrices, labels, dates, and tensors.

In [12]:

y = np.array([10.0, 20.0])
dates = np.array(["2026-01-01", "2026-01-08"])

bundle = {
    "X": X,
    "y": y,
    "dates": dates
}

npz_path = arrays_dir / "bundle.npz"
np.savez_compressed(npz_path, **bundle)

loaded = np.load(npz_path, allow_pickle=False)

print(loaded["X"])
print(loaded["y"])
print(loaded["dates"])

[[ 1.00e+02  1.01e+02  1.03e+02  2.00e+01  1.80e+01  2.20e+01  4.00e-01
  -1.00e-01]
 [ 1.01e+02  1.03e+02  1.04e+02  1.80e+01  2.20e+01  2.50e+01  2.00e-01
   3.00e-01]]
[10. 20.]
['2026-01-01' '2026-01-08']

Useful details:

Use NPZ for NumPy arrays.
The dictionary keys become array names inside the file.
allow_pickle=False is safer when loading normal numeric arrays.

Save and load pickle files

Pickle stores Python objects such as dictionaries, lists, and nested structures. Only load pickle files from trusted sources.

In [13]:

folds = {
    "fold_0": {
        "train_idx": [0, 1, 2],
        "valid_idx": [3, 4]
    },
    "fold_1": {
        "train_idx": [0, 2, 4],
        "valid_idx": [1, 3]
    }
}

pickle_path = objects_dir / "folds.pkl"

with open(pickle_path, "wb") as f:
    pickle.dump(folds, f)

with open(pickle_path, "rb") as f:
    loaded_folds = pickle.load(f)

print(loaded_folds)

{'fold_0': {'train_idx': [0, 1, 2], 'valid_idx': [3, 4]}, 'fold_1': {'train_idx': [0, 2, 4], 'valid_idx': [1, 3]}}

Useful details:

Use wb for writing binary pickle files.
Use rb for reading binary pickle files.
Pickle is convenient for Python objects but not ideal for public data exchange.
Do not load pickle files from unknown sources.

Save and load Parquet files

Parquet is a table format. Use it for pandas DataFrames with rows and columns. It preserves column types better than CSV and is usually more efficient for larger tables.

In [14]:

features_df = df.copy()

features_df["error"] = features_df["actual"] - features_df["predicted"]
features_df["abs_error"] = features_df["error"].abs()

parquet_path = tables_dir / "predictions.parquet"

features_df.to_parquet(
    parquet_path,
    index=False,
    engine="pyarrow"
)

loaded_df = pd.read_parquet(
    parquet_path,
    engine="pyarrow"
)

print(loaded_df.dtypes)
loaded_df

date         datetime64[us]
actual              float64
predicted           float64
volume                int64
error               float64
abs_error           float64
dtype: object

Out[14]:

	date	actual	predicted	volume	error	abs_error
0	2026-01-01	100.0	99.5	1200	0.5	0.5
1	2026-01-08	102.0	101.7	1350	0.3	0.3
2	2026-01-15	101.0	101.4	1280	-0.4	0.4
3	2026-01-22	104.0	103.2	1420	0.8	0.8

Use Parquet when the data is a structured table and should be reused by Python, Spark, DuckDB, BigQuery, or other analytics tools.

Python Files and Data Objects Reference

This notebook is a practical reference for common Python file, folder, array, and data export operations.

It uses a small local project layout:

data/
outputs/

The data/ folder is used for example input files. The outputs/ folder is used for generated text files, JSON files, NumPy archives, pickle files, CSV files, and Parquet files.

Install packages

The examples use the Python standard library, NumPy, pandas, and Parquet support through PyArrow.

In [1]:

# Run this once if the packages are missing.
# !pip install numpy pandas pyarrow

Imports and project folders

Use pathlib.Path for paths. It keeps file and folder operations readable and avoids manual string joining.

In [2]:

from pathlib import Path
import json
import pickle

import numpy as np
import pandas as pd

DATA_DIR = Path("data")
OUTPUT_DIR = Path("outputs")

DATA_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# print(DATA_DIR.resolve())
# print(OUTPUT_DIR.resolve())

Use pathlib for paths

Path creates a filesystem path object. Plain strings also work in many places, but Path is cleaner when joining folders, creating directories, reading files, and writing files.

In [3]:

base_dir = Path("outputs")
data_dir = base_dir / "data"
file_path = data_dir / "sample.csv"

print(base_dir)
print(data_dir)
print(file_path)

outputs
outputs/data
outputs/data/sample.csv

Key points:

The / operator joins path parts.
Path objects work with pandas, NumPy, JSON, pickle, and normal file operations.
Use strings for tiny examples, but prefer Path when a script manages files and folders.

Create folders safely

mkdir creates a folder. parents=True also creates missing parent folders. exist_ok=True avoids an error if the folder already exists.

In [4]:

run_dir = OUTPUT_DIR / "run_001"
tables_dir = run_dir / "tables"
arrays_dir = run_dir / "arrays"
objects_dir = run_dir / "objects"
logs_dir = run_dir / "logs"

for folder in [tables_dir, arrays_dir, objects_dir, logs_dir]:
    folder.mkdir(parents=True, exist_ok=True)

print(tables_dir.exists())
print(arrays_dir.exists())
print(objects_dir.exists())
print(logs_dir.exists())

True
True
True
True

The equivalent Bash command is:

In [5]:

# mkdir -p outputs/run_001/tables

Write and read small text files

write_text writes a plain text file. If the file already exists, it is overwritten. This is useful for marker files, run names, small logs, and simple notes.

In [6]:

active_name = "run_001"
marker_path = OUTPUT_DIR / "current_run.txt"

marker_path.write_text(active_name)

loaded_name = marker_path.read_text().strip()

print(loaded_name)

run_001

Useful details:

write_text is best for small text files.
read_text loads the whole file as a string.
strip removes trailing newline and surrounding whitespace.

Create an example CSV input file

The later pandas examples need a real input file. This cell creates a small dataset under data/.

In [7]:

raw_df = pd.DataFrame({
    "date": pd.to_datetime(["2026-01-01", "2026-01-08", "2026-01-15", "2026-01-22"]),
    "actual": [100.0, 102.0, 101.0, 104.0],
    "predicted": [99.5, 101.7, 101.4, 103.2],
    "volume": [1200, 1350, 1280, 1420]
})

input_csv_path = DATA_DIR / "sample_predictions.csv"
raw_df.to_csv(input_csv_path, index=False)

print(input_csv_path)
print(raw_df)

data/sample_predictions.csv
        date  actual  predicted  volume
0 2026-01-01   100.0       99.5    1200
1 2026-01-08   102.0      101.7    1350
2 2026-01-15   101.0      101.4    1280
3 2026-01-22   104.0      103.2    1420

Read CSV input with pandas

pd.read_csv loads a CSV file into a DataFrame. Use parse_dates when a column should become a datetime column.

In [8]:

df = pd.read_csv(input_csv_path, parse_dates=["date"])

print(df.dtypes)
df

date         datetime64[us]
actual              float64
predicted           float64
volume                int64
dtype: object

Out[8]:

	date	actual	predicted	volume
0	2026-01-01	100.0	99.5	1200
1	2026-01-08	102.0	101.7	1350
2	2026-01-15	101.0	101.4	1280
3	2026-01-22	104.0	103.2	1420

Save JSON config files

json.dump writes a Python dictionary to a JSON file. The indent option only changes readability, not the data itself.

In [9]:

config = {
    "seed": 42,
    "device": "cpu",
    "name": "example_run",
    "window": 12,
    "input_file": str(input_csv_path),
    "output_dir": str(run_dir)
}

config_path = run_dir / "config.json"

with open(config_path, "w") as f:
    json.dump(config, f, indent=2)

with open(config_path) as f:
    loaded_config = json.load(f)

print(loaded_config)

{'seed': 42, 'device': 'cpu', 'name': 'example_run', 'window': 12, 'input_file': 'data/sample_predictions.csv', 'output_dir': 'outputs/run_001'}

Common JSON indent choices:

Option	Meaning
`indent=1`	Pretty JSON with one space per level.
`indent=2`	Common compact readable JSON.
`indent=4`	More spaced readable JSON.
no indent	Compact one-line JSON.

Build arrays with hstack and vstack

hstack joins arrays side by side. vstack stacks rows on top of each other. This pattern is common when building feature matrices before training a model.

In [10]:

price_window = np.array([100, 101, 103])
count_window = np.array([20, 18, 22])
extra_features = np.array([0.4, -0.1])

row = np.hstack([price_window, count_window, extra_features])

row_2 = np.hstack([
    np.array([101, 103, 104]),
    np.array([18, 22, 25]),
    np.array([0.2, 0.3])
])

X = np.vstack([row, row_2])

print(row)
print(X)
print(X.shape)

[ 1.00e+02  1.01e+02  1.03e+02  2.00e+01  1.80e+01  2.20e+01  4.00e-01
 -1.00e-01]
[[ 1.00e+02  1.01e+02  1.03e+02  2.00e+01  1.80e+01  2.20e+01  4.00e-01
  -1.00e-01]
 [ 1.01e+02  1.03e+02  1.04e+02  1.80e+01  2.20e+01  2.50e+01  2.00e-01
   3.00e-01]]
(2, 8)

Key points:

hstack creates one flat feature row.
vstack creates the full two-dimensional matrix.
Each row should have the same number of columns before stacking.

Use reshape for expected dimensions

reshape changes the array shape without changing the values. reshape(-1, 1) turns a flat array into a column vector.

In [11]:

a = np.array([10, 20, 30])

column = a.reshape(-1, 1)
row = a.reshape(1, -1)

print(a.shape)
print(column.shape)
print(row.shape)

(3,)
(3, 1)
(1, 3)

Expression	Result shape	Use case
`a`	`(3,)`	Flat one-dimensional array.
`a.reshape(-1, 1)`	`(3, 1)`	Column vector.
`a.reshape(1, -1)`	`(1, 3)`	Single row matrix.

Save and load NPZ files

NPZ is a NumPy archive format. It stores multiple named arrays in one file. Compressed NPZ files are useful for cached feature matrices, labels, dates, and tensors.

In [12]:

y = np.array([10.0, 20.0])
dates = np.array(["2026-01-01", "2026-01-08"])

bundle = {
    "X": X,
    "y": y,
    "dates": dates
}

npz_path = arrays_dir / "bundle.npz"
np.savez_compressed(npz_path, **bundle)

loaded = np.load(npz_path, allow_pickle=False)

print(loaded["X"])
print(loaded["y"])
print(loaded["dates"])

[[ 1.00e+02  1.01e+02  1.03e+02  2.00e+01  1.80e+01  2.20e+01  4.00e-01
  -1.00e-01]
 [ 1.01e+02  1.03e+02  1.04e+02  1.80e+01  2.20e+01  2.50e+01  2.00e-01
   3.00e-01]]
[10. 20.]
['2026-01-01' '2026-01-08']

Useful details:

Use NPZ for NumPy arrays.
The dictionary keys become array names inside the file.
allow_pickle=False is safer when loading normal numeric arrays.

Save and load pickle files

Pickle stores Python objects such as dictionaries, lists, and nested structures. Only load pickle files from trusted sources.

In [13]:

folds = {
    "fold_0": {
        "train_idx": [0, 1, 2],
        "valid_idx": [3, 4]
    },
    "fold_1": {
        "train_idx": [0, 2, 4],
        "valid_idx": [1, 3]
    }
}

pickle_path = objects_dir / "folds.pkl"

with open(pickle_path, "wb") as f:
    pickle.dump(folds, f)

with open(pickle_path, "rb") as f:
    loaded_folds = pickle.load(f)

print(loaded_folds)

{'fold_0': {'train_idx': [0, 1, 2], 'valid_idx': [3, 4]}, 'fold_1': {'train_idx': [0, 2, 4], 'valid_idx': [1, 3]}}

Useful details:

Use wb for writing binary pickle files.
Use rb for reading binary pickle files.
Pickle is convenient for Python objects but not ideal for public data exchange.
Do not load pickle files from unknown sources.

Save and load Parquet files

Parquet is a table format. Use it for pandas DataFrames with rows and columns. It preserves column types better than CSV and is usually more efficient for larger tables.

In [14]:

features_df = df.copy()

features_df["error"] = features_df["actual"] - features_df["predicted"]
features_df["abs_error"] = features_df["error"].abs()

parquet_path = tables_dir / "predictions.parquet"

features_df.to_parquet(
    parquet_path,
    index=False,
    engine="pyarrow"
)

loaded_df = pd.read_parquet(
    parquet_path,
    engine="pyarrow"
)

print(loaded_df.dtypes)
loaded_df

date         datetime64[us]
actual              float64
predicted           float64
volume                int64
error               float64
abs_error           float64
dtype: object

Out[14]:

	date	actual	predicted	volume	error	abs_error
0	2026-01-01	100.0	99.5	1200	0.5	0.5
1	2026-01-08	102.0	101.7	1350	0.3	0.3
2	2026-01-15	101.0	101.4	1280	-0.4	0.4
3	2026-01-22	104.0	103.2	1420	0.8	0.8

Use Parquet when the data is a structured table and should be reused by Python, Spark, DuckDB, BigQuery, or other analytics tools.