Python Files and Data Objects Reference
This notebook is a practical reference for common Python file, folder, array, and data export operations.
It uses a small local project layout:
data/
outputs/
The data/ folder is used for example input files. The outputs/ folder is used for generated text files, JSON files, NumPy archives, pickle files, CSV files, and Parquet files.
Install packages
The examples use the Python standard library, NumPy, pandas, and Parquet support through PyArrow.
# Run this once if the packages are missing.
# !pip install numpy pandas pyarrow
Imports and project folders
Use pathlib.Path for paths. It keeps file and folder operations readable and avoids manual string joining.
from pathlib import Path
import json
import pickle
import numpy as np
import pandas as pd
DATA_DIR = Path("data")
OUTPUT_DIR = Path("outputs")
DATA_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
# print(DATA_DIR.resolve())
# print(OUTPUT_DIR.resolve())
Use pathlib for paths
Path creates a filesystem path object. Plain strings also work in many places, but Path is cleaner when joining folders, creating directories, reading files, and writing files.
base_dir = Path("outputs")
data_dir = base_dir / "data"
file_path = data_dir / "sample.csv"
print(base_dir)
print(data_dir)
print(file_path)
outputs outputs/data outputs/data/sample.csv
Key points:
- The
/operator joins path parts. Pathobjects work with pandas, NumPy, JSON, pickle, and normal file operations.- Use strings for tiny examples, but prefer
Pathwhen a script manages files and folders.
Create folders safely
mkdir creates a folder. parents=True also creates missing parent folders. exist_ok=True avoids an error if the folder already exists.
run_dir = OUTPUT_DIR / "run_001"
tables_dir = run_dir / "tables"
arrays_dir = run_dir / "arrays"
objects_dir = run_dir / "objects"
logs_dir = run_dir / "logs"
for folder in [tables_dir, arrays_dir, objects_dir, logs_dir]:
folder.mkdir(parents=True, exist_ok=True)
print(tables_dir.exists())
print(arrays_dir.exists())
print(objects_dir.exists())
print(logs_dir.exists())
True True True True
The equivalent Bash command is:
# mkdir -p outputs/run_001/tables
Write and read small text files
write_text writes a plain text file. If the file already exists, it is overwritten. This is useful for marker files, run names, small logs, and simple notes.
active_name = "run_001"
marker_path = OUTPUT_DIR / "current_run.txt"
marker_path.write_text(active_name)
loaded_name = marker_path.read_text().strip()
print(loaded_name)
run_001
Useful details:
write_textis best for small text files.read_textloads the whole file as a string.stripremoves trailing newline and surrounding whitespace.
Create an example CSV input file
The later pandas examples need a real input file. This cell creates a small dataset under data/.
raw_df = pd.DataFrame({
"date": pd.to_datetime(["2026-01-01", "2026-01-08", "2026-01-15", "2026-01-22"]),
"actual": [100.0, 102.0, 101.0, 104.0],
"predicted": [99.5, 101.7, 101.4, 103.2],
"volume": [1200, 1350, 1280, 1420]
})
input_csv_path = DATA_DIR / "sample_predictions.csv"
raw_df.to_csv(input_csv_path, index=False)
print(input_csv_path)
print(raw_df)
data/sample_predictions.csv
date actual predicted volume
0 2026-01-01 100.0 99.5 1200
1 2026-01-08 102.0 101.7 1350
2 2026-01-15 101.0 101.4 1280
3 2026-01-22 104.0 103.2 1420
Read CSV input with pandas
pd.read_csv loads a CSV file into a DataFrame. Use parse_dates when a column should become a datetime column.
df = pd.read_csv(input_csv_path, parse_dates=["date"])
print(df.dtypes)
df
date datetime64[us] actual float64 predicted float64 volume int64 dtype: object
| date | actual | predicted | volume | |
|---|---|---|---|---|
| 0 | 2026-01-01 | 100.0 | 99.5 | 1200 |
| 1 | 2026-01-08 | 102.0 | 101.7 | 1350 |
| 2 | 2026-01-15 | 101.0 | 101.4 | 1280 |
| 3 | 2026-01-22 | 104.0 | 103.2 | 1420 |
Save JSON config files
json.dump writes a Python dictionary to a JSON file. The indent option only changes readability, not the data itself.
config = {
"seed": 42,
"device": "cpu",
"name": "example_run",
"window": 12,
"input_file": str(input_csv_path),
"output_dir": str(run_dir)
}
config_path = run_dir / "config.json"
with open(config_path, "w") as f:
json.dump(config, f, indent=2)
with open(config_path) as f:
loaded_config = json.load(f)
print(loaded_config)
{'seed': 42, 'device': 'cpu', 'name': 'example_run', 'window': 12, 'input_file': 'data/sample_predictions.csv', 'output_dir': 'outputs/run_001'}
Common JSON indent choices:
| Option | Meaning |
|---|---|
indent=1 |
Pretty JSON with one space per level. |
indent=2 |
Common compact readable JSON. |
indent=4 |
More spaced readable JSON. |
| no indent | Compact one-line JSON. |
Build arrays with hstack and vstack
hstack joins arrays side by side. vstack stacks rows on top of each other. This pattern is common when building feature matrices before training a model.
price_window = np.array([100, 101, 103])
count_window = np.array([20, 18, 22])
extra_features = np.array([0.4, -0.1])
row = np.hstack([price_window, count_window, extra_features])
row_2 = np.hstack([
np.array([101, 103, 104]),
np.array([18, 22, 25]),
np.array([0.2, 0.3])
])
X = np.vstack([row, row_2])
print(row)
print(X)
print(X.shape)
[ 1.00e+02 1.01e+02 1.03e+02 2.00e+01 1.80e+01 2.20e+01 4.00e-01 -1.00e-01] [[ 1.00e+02 1.01e+02 1.03e+02 2.00e+01 1.80e+01 2.20e+01 4.00e-01 -1.00e-01] [ 1.01e+02 1.03e+02 1.04e+02 1.80e+01 2.20e+01 2.50e+01 2.00e-01 3.00e-01]] (2, 8)
Key points:
hstackcreates one flat feature row.vstackcreates the full two-dimensional matrix.- Each row should have the same number of columns before stacking.
Use reshape for expected dimensions
reshape changes the array shape without changing the values. reshape(-1, 1) turns a flat array into a column vector.
a = np.array([10, 20, 30])
column = a.reshape(-1, 1)
row = a.reshape(1, -1)
print(a.shape)
print(column.shape)
print(row.shape)
(3,) (3, 1) (1, 3)
| Expression | Result shape | Use case |
|---|---|---|
a |
(3,) |
Flat one-dimensional array. |
a.reshape(-1, 1) |
(3, 1) |
Column vector. |
a.reshape(1, -1) |
(1, 3) |
Single row matrix. |
Save and load NPZ files
NPZ is a NumPy archive format. It stores multiple named arrays in one file. Compressed NPZ files are useful for cached feature matrices, labels, dates, and tensors.
y = np.array([10.0, 20.0])
dates = np.array(["2026-01-01", "2026-01-08"])
bundle = {
"X": X,
"y": y,
"dates": dates
}
npz_path = arrays_dir / "bundle.npz"
np.savez_compressed(npz_path, **bundle)
loaded = np.load(npz_path, allow_pickle=False)
print(loaded["X"])
print(loaded["y"])
print(loaded["dates"])
[[ 1.00e+02 1.01e+02 1.03e+02 2.00e+01 1.80e+01 2.20e+01 4.00e-01 -1.00e-01] [ 1.01e+02 1.03e+02 1.04e+02 1.80e+01 2.20e+01 2.50e+01 2.00e-01 3.00e-01]] [10. 20.] ['2026-01-01' '2026-01-08']
Useful details:
- Use NPZ for NumPy arrays.
- The dictionary keys become array names inside the file.
allow_pickle=Falseis safer when loading normal numeric arrays.
Save and load pickle files
Pickle stores Python objects such as dictionaries, lists, and nested structures. Only load pickle files from trusted sources.
folds = {
"fold_0": {
"train_idx": [0, 1, 2],
"valid_idx": [3, 4]
},
"fold_1": {
"train_idx": [0, 2, 4],
"valid_idx": [1, 3]
}
}
pickle_path = objects_dir / "folds.pkl"
with open(pickle_path, "wb") as f:
pickle.dump(folds, f)
with open(pickle_path, "rb") as f:
loaded_folds = pickle.load(f)
print(loaded_folds)
{'fold_0': {'train_idx': [0, 1, 2], 'valid_idx': [3, 4]}, 'fold_1': {'train_idx': [0, 2, 4], 'valid_idx': [1, 3]}}
Useful details:
- Use
wbfor writing binary pickle files. - Use
rbfor reading binary pickle files. - Pickle is convenient for Python objects but not ideal for public data exchange.
- Do not load pickle files from unknown sources.
Save and load Parquet files
Parquet is a table format. Use it for pandas DataFrames with rows and columns. It preserves column types better than CSV and is usually more efficient for larger tables.
features_df = df.copy()
features_df["error"] = features_df["actual"] - features_df["predicted"]
features_df["abs_error"] = features_df["error"].abs()
parquet_path = tables_dir / "predictions.parquet"
features_df.to_parquet(
parquet_path,
index=False,
engine="pyarrow"
)
loaded_df = pd.read_parquet(
parquet_path,
engine="pyarrow"
)
print(loaded_df.dtypes)
loaded_df
date datetime64[us] actual float64 predicted float64 volume int64 error float64 abs_error float64 dtype: object
| date | actual | predicted | volume | error | abs_error | |
|---|---|---|---|---|---|---|
| 0 | 2026-01-01 | 100.0 | 99.5 | 1200 | 0.5 | 0.5 |
| 1 | 2026-01-08 | 102.0 | 101.7 | 1350 | 0.3 | 0.3 |
| 2 | 2026-01-15 | 101.0 | 101.4 | 1280 | -0.4 | 0.4 |
| 3 | 2026-01-22 | 104.0 | 103.2 | 1420 | 0.8 | 0.8 |
Use Parquet when the data is a structured table and should be reused by Python, Spark, DuckDB, BigQuery, or other analytics tools.