Point-Biserial and Pearson Correlations for Automotive Freeze-Frame Data

Introduction

In modern vehicle diagnostics, freeze-frame data captures snapshots of sensor readings and monitors when a diagnostic trouble code (DTC) triggers. By analyzing these snapshots, we can spot trends that might explain why certain DTCs appear or how monitors reach a “complete” state. Below is a Python script that demonstrates how to:

Produce a Pearson correlation matrix for numeric parameters (like engine RPM, coolant temperature, etc.)
Conduct Point-Biserial correlation between numeric parameters and binary monitors (e.g., “Complete” vs. “NotComplete”)
Generate separate tabs for the top 10 most frequent DTCs, plus separate tabs for “MIL OFF” and “MIL ON” conditions, each with its own correlation results.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pointbiserialr
from openpyxl import load_workbook

# ------------------------- USER SETTINGS -------------------------
input_file = 'example input file.csv'            # CSV input file path
output_file = 'Correlation_Results_Top10DTC.xlsx' # Excel output file

dtc_col = "DTC that caused required freeze frame data storage ()"  # Column for DTC codes
mil_col = "MIL"  # Column for MIL status (values: "ON", "OFF", possibly others)

# Original list of possible monitoring columns
monitoring_columns = [
    "MisfireMonitoring",
    "EGR_VVTSystemMonitoring",
    "SecondaryAirSystemMonitoring",
    "NMHCCatalystMonitoring",
    "FuelSystemMonitoring",
    "CatalystMonitoring",
    "NOxAftertreatmentMonitoring",
    "PMFilterMonitoring",
    "EvaporativeSystemMonitoring",
    "BoostPressureSystemMonitoring",
    "HeatedCatalystMonitoring",
    "OxygenSensorHeaterMonitoring",
    "ExhaustGasSensorMonitoring",
    "ComprehensiveComponentMonitoring",
    "OxygenSensorMonitoring"
]

# ------------------------- READ DATA -------------------------
df = pd.read_csv(input_file)

# (1) Dynamically remove columns with 'NotSupported' > 90%
for col in list(monitoring_columns):
    if col not in df.columns:
        monitoring_columns.remove(col)
        continue

    freq = df[col].value_counts(dropna=False, normalize=True)
    if ('NotSupported' in freq.index) and (freq['NotSupported'] > 0.90):
        monitoring_columns.remove(col)
        print(f"Removed '{col}' because 'NotSupported' exceeds 90%.")

# (2) Create a binary version (1 for Complete, 0 for NotComplete; others => NaN)
df_mon = df[monitoring_columns].apply(lambda c: c.map({'Complete': 1, 'NotComplete': 0}))

# (3) Identify numeric freeze-frame columns
df_numeric = df.select_dtypes(include=[np.number])

# Combine numeric + binary monitor columns into one DataFrame for correlation
df_corr = pd.concat([df_numeric, df_mon], axis=1)

# ------------------ GLOBAL CORRELATION (entire dataset) ------------------
# A) Pearson correlation matrix
corr_matrix = df_corr.corr()

# B) Point-Biserial correlation (entire dataset)
global_results = []
for num_col in df_numeric.columns:
    for mon_col in monitoring_columns:
        if mon_col not in df_mon.columns:
            continue
        valid_data = df[[num_col]].join(df_mon[mon_col]).dropna()
        if len(valid_data) < 2:
            continue
        if valid_data[num_col].nunique() < 2 or valid_data[mon_col].nunique() < 2:
            continue

        r, p = pointbiserialr(valid_data[num_col], valid_data[mon_col])
        global_results.append({
            "Numeric Column": num_col,
            "Monitor": mon_col,
            "Point-Biserial r": r,
            "p-value": p
        })

results_df = pd.DataFrame(global_results)

# ------------------ PER-DTC CORRELATION (TOP 10) ------------------
top_dtc_values = df[dtc_col].value_counts(dropna=True).head(10).index

dtc_sheets = {}
for dtc_val in top_dtc_values:
    dtc_subset = df[df[dtc_col] == dtc_val]
    dtc_mon_subset = df_mon.loc[dtc_subset.index]
    dtc_num_subset = df_numeric.loc[dtc_subset.index]

    dtc_results = []
    for num_col in dtc_num_subset.columns:
        for mon_col in monitoring_columns:
            if mon_col not in dtc_mon_subset.columns:
                continue
            valid_data = dtc_subset[[num_col]].join(dtc_mon_subset[mon_col]).dropna()
            if len(valid_data) < 2:
                continue
            if valid_data[num_col].nunique() < 2 or valid_data[mon_col].nunique() < 2:
                continue

            r, p = pointbiserialr(valid_data[num_col], valid_data[mon_col])
            dtc_results.append({
                "Numeric Column": num_col,
                "Monitor": mon_col,
                "Point-Biserial r": r,
                "p-value": p
            })
    dtc_sheets[dtc_val] = pd.DataFrame(dtc_results)

# ------------------ MIL=OFF & MIL=ON CORRELATION ------------------
mil_off_df = pd.DataFrame()
mil_on_df = pd.DataFrame()

if mil_col in df.columns:
    # Subset to MIL == "OFF"
    df_mil_off = df[df[mil_col] == "OFF"]
    if not df_mil_off.empty:
        mil_off_mon = df_mon.loc[df_mil_off.index]
        mil_off_num = df_numeric.loc[df_mil_off.index]

        mil_off_results = []
        for num_col in mil_off_num.columns:
            for mon_col in monitoring_columns:
                if mon_col not in mil_off_mon.columns:
                    continue
                valid_data = df_mil_off[[num_col]].join(mil_off_mon[mon_col]).dropna()
                if len(valid_data) < 2:
                    continue
                if valid_data[num_col].nunique() < 2 or valid_data[mon_col].nunique() < 2:
                    continue

                r, p = pointbiserialr(valid_data[num_col], valid_data[mon_col])
                mil_off_results.append({
                    "Numeric Column": num_col,
                    "Monitor": mon_col,
                    "Point-Biserial r": r,
                    "p-value": p
                })
        mil_off_df = pd.DataFrame(mil_off_results)

    # Subset to MIL == "ON"
    df_mil_on = df[df[mil_col] == "ON"]
    if not df_mil_on.empty:
        mil_on_mon = df_mon.loc[df_mil_on.index]
        mil_on_num = df_numeric.loc[df_mil_on.index]

        mil_on_results = []
        for num_col in mil_on_num.columns:
            for mon_col in monitoring_columns:
                if mon_col not in mil_on_mon.columns:
                    continue
                valid_data = df_mil_on[[num_col]].join(mil_on_mon[mon_col]).dropna()
                if len(valid_data) < 2:
                    continue
                if valid_data[num_col].nunique() < 2 or valid_data[mon_col].nunique() < 2:
                    continue

                r, p = pointbiserialr(valid_data[num_col], valid_data[mon_col])
                mil_on_results.append({
                    "Numeric Column": num_col,
                    "Monitor": mon_col,
                    "Point-Biserial r": r,
                    "p-value": p
                })
        mil_on_df = pd.DataFrame(mil_on_results)


# --------------- HELPER FUNCTIONS FOR STYLING --------------------
def style_pearson_corr(df: pd.DataFrame):
    """
    Color-code a DataFrame (PearsonCorr) in [-1, +1].
    """
    return df.style.background_gradient(cmap='RdYlGn', vmin=-1, vmax=1)

def style_pointbiserial_df(df: pd.DataFrame):
    """
    Sort DataFrame by 'Point-Biserial r' ascending,
    then color gradients for 'Point-Biserial r' & 'p-value'.
    """
    if df.empty:
        return df.style
    if "Point-Biserial r" not in df.columns or "p-value" not in df.columns:
        return df.style

    df_sorted = df.sort_values("Point-Biserial r", ascending=True)
    styler = (
        df_sorted.style
        .background_gradient(cmap="RdYlGn", subset=["Point-Biserial r"], vmin=-1, vmax=1)
        .background_gradient(cmap="RdYlGn_r", subset=["p-value"], vmin=0, vmax=1)
    )
    return styler

# --------------------- WRITE RESULTS TO EXCEL ---------------------
with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
    # 1) Raw Data tab
    df.to_excel(writer, sheet_name="RawData", index=False)

    # 2) PearsonCorrMatrix
    pearson_styler = style_pearson_corr(corr_matrix)
    pearson_styler.to_excel(writer, sheet_name="PearsonCorrMatrix")

    # 3) Global point-biserial correlation
    global_styler = style_pointbiserial_df(results_df)
    global_styler.to_excel(writer, sheet_name="PointBiserialResults", index=False)

    # 4) Each top-10 DTC
    dtc_sheet_names = []
    for dtc_val in top_dtc_values:
        sheet_name = str(dtc_val)[:31].replace("/", "_")
        dtc_sheet_names.append(sheet_name)

        df_dtc = dtc_sheets[dtc_val]
        if df_dtc.empty:
            empty_df = pd.DataFrame(columns=["Numeric Column","Monitor","Point-Biserial r","p-value"])
            empty_df.to_excel(writer, sheet_name=sheet_name, index=False)
        else:
            dtc_styler = style_pointbiserial_df(df_dtc)
            dtc_styler.to_excel(writer, sheet_name=sheet_name, index=False)

    # 5) MIL=OFF correlation
    mil_off_sheet = "MIL_OFF"
    if mil_off_df.empty:
        pd.DataFrame(columns=["Numeric Column","Monitor","Point-Biserial r","p-value"])\
          .to_excel(writer, sheet_name=mil_off_sheet, index=False)
    else:
        mil_off_styler = style_pointbiserial_df(mil_off_df)
        mil_off_styler.to_excel(writer, sheet_name=mil_off_sheet, index=False)

    # 6) MIL=ON correlation
    mil_on_sheet = "MIL_ON"
    if mil_on_df.empty:
        pd.DataFrame(columns=["Numeric Column","Monitor","Point-Biserial r","p-value"])\
          .to_excel(writer, sheet_name=mil_on_sheet, index=False)
    else:
        mil_on_styler = style_pointbiserial_df(mil_on_df)
        mil_on_styler.to_excel(writer, sheet_name=mil_on_sheet, index=False)

# ---------- ADD LEGEND LINES TO CORRELATION & DTC SHEETS ----------
wb = load_workbook(output_file)

# Build a list of sheets to add the color legend to
correlation_sheets = [
    "PearsonCorrMatrix",
    "PointBiserialResults",
    mil_off_sheet,
    mil_on_sheet
] + dtc_sheet_names

for sheet_name in correlation_sheets:
    if sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        next_row = ws.max_row + 2  # a couple rows below existing data

        if sheet_name == "PearsonCorrMatrix":
            # Legend for correlation matrix
            ws.cell(row=next_row, column=1).value = "Legend for PearsonCorrMatrix Color Scale:"
            ws.cell(row=next_row+1, column=1).value = "• Red = -1, Yellow = 0, Green = +1"
        else:
            # Legend for point-biserial
            ws.cell(row=next_row, column=1).value = "Legend for Point-Biserial Correlations:"
            ws.cell(row=next_row+1, column=1).value = "• 'Point-Biserial r': Red = -1 → Green = +1"
            ws.cell(row=next_row+2, column=1).value = "• 'p-value': Green = 0.0 → Red = 1.0"

wb.save(output_file)

print(f"Done! Correlation results (RawData, global, top 10 DTC, MIL=OFF, MIL=ON) have been written to '{output_file}' with color-coding and legends.")

Key Steps in the Script

Raw Data Tab
- The script first writes the entire raw dataset into a RawData sheet in Excel for easy reference.
Filtering Monitors
- Any monitor column that is 'NotSupported' more than 90% of the time is removed, reducing clutter.
Binary Conversion
- Each monitor is mapped to 1 if “Complete,” 0 if “NotComplete,” and NaN otherwise. This lets us do Point-Biserial correlations.
Pearson Correlation Matrix (Entire Dataset)
- df.corr() across all numeric columns (including newly added 0/1 binary columns).
- The matrix is color-coded from -1 (red) to +1 (green).
Point-Biserial Correlation (Entire Dataset)
- For numeric vs. binary columns, we compute the correlation coefficient r and p-value (statistical significance).
- The script sorts results by Point-Biserial r, color-codes them, then writes them to a PointBiserialResults sheet.
Top 10 DTC Tabs
- We identify the ten most common DTCs.
- For each DTC, we subset the data and compute Point-Biserial correlations again, giving us a more focused look at how freeze-frame parameters relate to monitor statuses specifically for that DTC.
MIL OFF and MIL ON Tabs
- The script also subsets data where the “MIL” (Malfunction Indicator Lamp) is explicitly 'OFF' or 'ON'.
- This helps see if monitors behave differently when the dash light is on vs. off.
Adding a Legend
- Finally, it reopens the Excel file via openpyxl and inserts a short text-based legend in each correlation sheet:
  - PearsonCorrMatrix: “Red = -1, Green = +1”
  - Point-Biserial: “Red = -1 → Green = +1” for r, and “Green = 0 → Red = 1” for p-value.

Why Pearson and Point-Biserial?

Pearson: Measures how two numeric variables (e.g., Engine Speed vs. Vehicle Speed) move together.
Point-Biserial: Measures the relationship between a numeric variable (like coolant temperature) and a binary variable (like a monitor’s complete/incomplete status).

Because vehicle data often mixes numeric parameters with yes/no statuses, combining these two correlations gives a broad view of how sensor readings align with monitor readiness or with each other.

Conclusion

This workflow provides a convenient, color-coded Excel file that highlights:

Raw freeze-frame data
Overall correlation among numeric columns
Numeric vs. binary monitor correlations, both globally and filtered by DTC or MIL status

With these insights, engineers or data analysts can quickly see which conditions or sensor readings align with certain monitor states or DTC triggers, enabling more targeted troubleshooting or predictive analytics in automotive diagnostics.