Flights On Time

Technical Manual and Project Reference Document

FlightOnTime is a flight delay prediction system designed to anticipate, based on historical data and Machine Learning techniques, whether a flight will depart on time or with a significant delay. The project stems from a specific need in the aviation industry: delays not only affect the passenger experience but also generate operational inefficiencies, extra costs, and planning issues that could be mitigated if the risk is known in advance.

The main objective of the system is not to retrospectively explain why a delay occurred, but to <strong>estimate the risk of it occurring before takeoff</strong>, when preventive decisions can still be made. To achieve this, FlightOnTime combines a robust Data Science model with an architecture designed for real production use, exposing predictions through a REST API.

This approach turns the project into a complete solution that goes beyond exploratory analysis or experimental modeling, positioning itself as a practical tool to support decision-making.

From a Data Science perspective, the problem was formulated as a <strong>binary classification problem</strong>, where the model must predict whether a flight will belong to one of the following classes:

  • • On-time flight
  • • Flight with a delay equal to or greater than 15 minutes

The target variable used is DEP_DEL15, widely used in aeronautical datasets, which ensures consistency with real industry standards. This definition allows the prediction to have a clear operational meaning that is directly interpretable by both technical and non-technical users.

The main challenge lies in the high variability of the air system: multiple factors interact simultaneously, from weather conditions to specific operational patterns of airlines, airports, and time slots. Therefore, the project's approach prioritizes models capable of capturing complex relationships without sacrificing robustness.

The model was trained using a large-scale dataset composed of approximately <strong>35.7 million US domestic flights</strong>. This volume of data allows for the capture of structural patterns in the aviation system that would not be visible in small or limited datasets.

In addition to basic flight information, the project incorporates a <strong>data enrichment</strong> process, integrating external variables relevant to delay prediction. Specifically, historical weather information and geolocation data obtained through specialized APIs and sources evaluated for their reliability were added.

This enrichment is one of the project's main strengths, as it introduces factors that directly influence flight punctuality and are often omitted in simplified approaches.

The final set of features used by the model was carefully selected to meet two fundamental criteria: predictive relevance and pre-flight availability.

Temporal Variables

These variables allow for capturing seasonality and hourly patterns:

  • • Year (YEAR)
  • • Month (MONTH)
  • • Day of the week (DAY_OF_WEEK)
  • • Scheduled minute of the day (sched_minute_of_day)

Operational Variables

Represent characteristics specific to the flight:

  • • Airline (OP_UNIQUE_CARRIER)
  • • Origin airport (ORIGIN)
  • • Destination airport (DEST)
  • • Flight distance (DISTANCE, DIST_MET_KM)

Weather Variables

Introduce the impact of the meteorological environment:

  • • Temperature (TEMP)
  • • Wind speed (WIND_SPD)
  • • Hourly precipitation (PRECIP_1H)
  • • Climate severity index (CLIMATE_SEVERITY_IDX)

Discarded Variables

Variables that could generate data leakage, were not available in real-time, or did not provide clear predictive value were removed, such as:

  • • DEP_DELAY
  • • FL_DATE
  • • CRS_DEP_TIME
  • • City names instead of standardized codes

This refinement reinforces the realistic and productive nature of the model.

After evaluating different approaches, the final model selected was XGBoost, a gradient boosting algorithm widely used in classification problems with tabular data.

The choice of XGBoost is based on its ability to:

  • • Capture complex non-linear relationships
  • • Handle interactions between multiple variables
  • • Scale efficiently with large volumes of data
  • • Maintain excellent predictive performance

The model was trained, evaluated, and optimized, reaching the following metrics:

  • • Accuracy: 72.32%
  • • Recall: 54.30%
  • • ROC-AUC: 0.7194
  • • Optimized Threshold: 0.5591

These metrics reflect an appropriate balance between overall accuracy and the ability to identify flights with real risk of delay, prioritizing early detection of the most critical cases.

FlightOnTime was designed with a modular and decoupled architecture, clearly separating the responsibilities of each component:

  • • Data Science Layer: model training, evaluation, and serialization
  • • Back-End Layer: model exposure via a REST API
  • • Front-End Layer: API consumption by end users

The model was serialized using Joblib and dynamically loaded by the API developed with FastAPI. Communication between the Back-End and the model is based on a previously defined feature contract, which ensures consistency between training and prediction in production.

From the user or client system perspective, the project operation is simple and transparent.

To request a prediction, basic flight data must be provided to the API, including:

  • • Airline
  • • Origin and destination airport
  • • Scheduled departure date and time
  • • Necessary information to reconstruct temporal and spatial variables

The API validates the information, applies the necessary transformations, and executes the predictive model.

As a result, the system returns:

To request a prediction, basic flight data must be provided to the API, including:

  • • A classification: on-time flight or delayed flight
  • • A probability associated with the delay

This probability should not be interpreted as an absolute certainty, but rather as a risk indicator. High values indicate that, given similar historical conditions, there is a high probability of delay. This allows for informed decision-making, such as reinforcing operational planning, anticipating communications, or prioritizing the monitoring of certain flights.

FlightOnTime demonstrates how Data Science generates real impact when integrated into usable systems. The project combines technical rigor, use of real data, robust modeling, and an architecture designed for production.

It does not promise to eliminate delays, but it does reduce uncertainty, which is one of the major operational problems in the aviation industry. In this sense, the project positions itself as a necessary, scalable solution aligned with real professional practices.