The F1 Score Obsession

When metrics matter and when they mislead.

Every ML engineer has been there. You present results from a classification model, and the first question is always: "What's the F1 score?" Not "What's the precision?" Not "What's the recall?" Not "How does this translate to business value?" Just F1.

The F1 score has become the default metric for classification problems. It appears in nearly every academic paper, every Kaggle competition, every model evaluation report. But this default choice is often wrong, sometimes catastrophically so. Understanding when F1 is appropriate and when it misleads is essential for building ML systems that work in production.

Why F1 Became the Default

The F1 score's dominance is understandable. It's a single number that summarizes model performance, making it easy to compare models and communicate results. For stakeholders unfamiliar with ML, "our F1 is 0.85" is more digestible than a confusion matrix or precision-recall curve.

Mathematically, F1 is the harmonic mean of precision and recall: The harmonic mean punishes extreme values more than the arithmetic mean. An F1 of 0.9 requires both precision and recall to be reasonably high. You can't have 1.0 precision and 0.5 recall and claim good performance.

F1 = 2 * (precision * recall) / (precision + recall)

This balance between precision and recall makes F1 appealing. It prevents gaming the metric by optimizing for one at the expense of the other. A model with 1.0 precision and 0.3 recall has an F1 of only 0.46, exposing the poor recall immediately.

Academic research further entrenched F1 as the standard. Papers need comparable baselines, and F1 provides a clean, consistent benchmark across datasets. Competitions like ImageNet and COCO popularized metrics like top-5 accuracy and mAP, but for binary and multi-class classification, F1 remained the gold standard.

The problem is that real-world applications rarely have equal costs for false positives and false negatives. And when they don't, F1 can be dangerously misleading.

When F1 Misleads

Class Imbalance

Consider fraud detection. You're building a model to identify fraudulent transactions in a payment processing system. The fraud rate is 0.1%, meaning 1 in 1,000 transactions is fraudulent.

Your model achieves an F1 of 0.6. Sounds reasonable, right? But look closer at the confusion matrix:

True Positives: 5 (correctly identified fraudulent transactions)
False Positives: 100 (legitimate transactions flagged as fraud)
False Negatives: 5 (fraudulent transactions missed)
True Negatives: 9,890 (legitimate transactions correctly identified)

Your precision is 0.048 (5 out of 105 flagged transactions are actually fraud). Your recall is 0.5 (you caught 5 out of 10 fraudulent transactions). The F1 score of 0.087 reflects this poor performance, but the real issue is clearer when you consider the business impact: every flagged transaction requires manual review, and you're overwhelming your fraud team with false positives.

With severe class imbalance, a high F1 doesn't guarantee useful predictions. A model that predicts "fraud" for everything achieves perfect recall but near-zero precision. A model that predicts "fraud" for nothing achieves perfect precision but zero recall. F1 tries to balance these extremes, but in highly imbalanced scenarios, it can obscure critical performance issues.

Cost Asymmetry

Even without class imbalance, F1 fails when false positives and false negatives have different costs. Medical diagnosis is the canonical example. A false negative (missing cancer) can be fatal. A false positive (flagging healthy tissue as cancerous) leads to additional testing, patient anxiety, and unnecessary biopsies, but it's not fatal.

A screening model with 0.95 recall and 0.70 precision might have an F1 of 0.81. A more conservative model with 0.85 recall and 0.80 precision has an F1 of 0.82. The second model has a "better" F1, but in a screening context, you likely prefer the first model that catches more cases, even if it generates more false positives. In reality, you'd use precision-recall curves and choose an operating point based on acceptable false positive rates, but this illustrates the fundamental issue: F1 assumes equal costs.

The same logic applies to spam detection, but in reverse. A false positive (legitimate email in spam) can mean missing critical business communications. A false negative (spam in inbox) is annoying but not catastrophic. Here you want high precision, even if recall suffers. F1 treats these scenarios identically.

Precision vs Recall Trade-Offs in Production

The precision-recall trade-off is fundamental to classification. Lowering your classification threshold increases recall (you catch more positive cases) but decreases precision (you also catch more false positives). Raising the threshold does the opposite.

In production, this trade-off is determined by business constraints, not by maximizing F1. Consider adverse media screening for anti-money laundering (AML) compliance. You're building a system to flag individuals and entities mentioned in news articles about financial crimes, corruption, sanctions violations, and other illicit activities.

Regulations require financial institutions to conduct ongoing monitoring of customers and counterparties. Missing a sanctioned entity or a politically exposed person (PEP) involved in corruption can result in regulatory fines, reputational damage, and legal liability. The cost of a false negative is high.

False positives, on the other hand, lead to manual review by compliance analysts. Too many false positives overwhelm the team, but missing high-risk entities is worse. In this context, you optimize for high recall, accepting lower precision as a necessary trade-off. You might target 0.95 recall with 0.60 precision, yielding an F1 of 0.74. A model with F1 of 0.80 but only 0.85 recall is less valuable, even though its F1 is higher.

Contrast this with a recommendation system for e-commerce. False positives (recommending products the user doesn't want) are mildly annoying. False negatives (not showing products the user would buy) are missed revenue opportunities. But precision matters more here: showing irrelevant recommendations erodes trust and engagement. You'd likely optimize for precision, accepting lower recall. An F1-optimized model might perform worse on the metrics that matter for user experience and revenue.

Domain-Specific Metric Needs

Different domains have evolved their own metrics that better capture the nuances of their problems. Understanding these metrics reveals the limitations of F1 and provides alternatives for specific use cases.

Compliance: High Recall, Bounded Precision

Compliance domains like AML, know-your-customer (KYC), and sanctions screening prioritize recall. Missing a sanctioned entity or failing to detect money laundering can result in multi-million-dollar fines and regulatory enforcement actions. False positives are costly in terms of analyst time, but false negatives are existential risks.

In practice, compliance systems aim for recall above 0.95, with precision targets set by operational capacity. If your compliance team can review 1,000 alerts per day, and you have 10,000 transactions, your false positive rate determines whether you overwhelm the team. You optimize for maximum recall subject to a precision constraint, not for F1.

Spam Detection: High Precision Required

Spam filters need high precision. A legitimate email sent to spam (false positive) can mean missing job offers, business inquiries, or important personal messages. Spam in the inbox (false negative) is annoying but not catastrophic.

Modern spam filters achieve precision above 0.99, often at the cost of lower recall. Users tolerate occasional spam far better than missing important emails. Gmail's spam filter, for instance, optimizes for precision, using multiple signals (sender reputation, content analysis, user behavior) to minimize false positives. F1 would encourage more aggressive filtering, but that would degrade user experience.

Object Detection: mAP Over F1

Object detection models like YOLO, Faster R-CNN, and EfficientDet use mean Average Precision (mAP) instead of F1. mAP computes precision-recall curves for each class, calculates the average precision (area under the curve), and averages across classes.

Why not F1? Object detection involves both classification (what is the object?) and localization (where is it?). A prediction is correct only if the class is right and the bounding box overlaps sufficiently with the ground truth (typically Intersection over Union > 0.5). mAP captures both aspects and provides a single metric that summarizes performance across all classes and IoU thresholds.

The COCO dataset uses mAP@[0.5:0.95], averaging mAP over IoU thresholds from 0.5 to 0.95 in 0.05 increments. This rewards models that produce tight, accurate bounding boxes, not just correct classifications. F1 doesn't naturally extend to this setting without arbitrary choices about how to handle localization errors.

Business Metrics vs ML Metrics

The most important distinction is between ML metrics and business metrics. F1, precision, recall, and accuracy are proxies for what you actually care about: revenue, user engagement, cost savings, risk reduction, or regulatory compliance.

A fraud detection model with F1 of 0.7 that saves the company 10 million dollars annually is better than a model with F1 of 0.8 that saves 5 million. A recommendation model with precision of 0.6 that increases click-through rate by 20% is better than a model with precision of 0.7 that increases click-through rate by 10%. This assumes the models have comparable recall. If the 0.6 precision model achieves higher CTR by recommending more items (higher recall), you're not comparing like-for-like. Always consider the full precision-recall trade-off.

The challenge is that business metrics are harder to measure during model development. You can compute F1 on a validation set in minutes. Measuring revenue impact requires deploying the model, running A/B tests, and waiting for statistical significance. This makes F1 a useful development metric, but only if it correlates with the business metric you care about.

For some applications, the correlation is weak. Consider a content moderation system that removes toxic comments. The business goal is to improve user experience and reduce harassment. A model with high F1 on a toxicity dataset might still perform poorly in production if it fails to catch subtle harassment or over-moderates borderline content, driving away users.

The solution is to establish the relationship between ML metrics and business metrics early. If you're building a fraud detection model, estimate the cost of false positives (manual review time, customer friction) and false negatives (fraud losses, chargebacks). Use these costs to define a custom loss function or evaluation metric that aligns with business value. F1 is a starting point, not the endpoint.

A Practical Framework for Choosing Metrics

Given the limitations of F1, how do you choose the right metric? Here's a framework based on production ML systems across compliance, e-commerce, and industrial applications:

Identify the business objective. What are you optimizing for? Revenue, cost reduction, user satisfaction, regulatory compliance? Be specific. "Improve fraud detection" is too vague. "Reduce fraud losses by 20% while keeping false positive rate under 1%" is actionable.
Quantify the cost of errors. What's the cost of a false positive? What's the cost of a false negative? If you can't quantify these precisely, estimate ranges. Even rough estimates clarify whether precision or recall is more important.
Check for class imbalance. If positive class frequency is below 5%, standard metrics like accuracy and F1 can be misleading. Consider precision-recall curves, area under the precision-recall curve (AUC-PR), or resampling techniques to balance the dataset.
Choose the appropriate metric. Based on the above, select a metric that aligns with the business objective. High-stakes false negatives? Optimize for recall. High-stakes false positives? Optimize for precision. Balanced costs? F1 is reasonable. Need to evaluate across multiple classes or thresholds? Consider mAP or AUC-PR.
Validate with business metrics. After deployment, measure the business impact. Does higher F1 correlate with better business outcomes? If not, revisit your metric choice. This feedback loop is essential for long-term success.

For most production systems, you'll end up optimizing for a metric other than F1. Compliance systems optimize for recall subject to precision constraints. Recommendation systems optimize for click-through rate or conversion rate. Fraud detection systems optimize for expected cost (fraud losses minus review costs). F1 is a tool, not a target.

Conclusion

F1 has its place. For balanced datasets with symmetric costs, it's a reasonable default. For comparing models in academic settings, it provides a standard benchmark. But in production, F1 is often the wrong metric.

The obsession with F1 reflects a broader tendency in ML to optimize for convenient metrics rather than meaningful ones. Accuracy, F1, and even AUC are all proxies. They're useful during development, but they're not the goal. The goal is to build systems that deliver business value, and that requires understanding the real-world costs and constraints of your application.

Next time someone asks for your F1 score, ask them what they're optimizing for. If the answer is "balance between precision and recall," ask why. If they can't articulate the cost of false positives versus false negatives, the conversation isn't about metrics. It's about understanding the problem. And that's where good ML work begins.