Let’s test RPT-1 on Titanic – chrwittm.github.io

How do you test SAP’s RPT-1 model beyond a simple Hello World example? Its claim to fame is that it can replace traditional machine learning models for classification and regression tasks. No model training, no hyperparameter tuning, just show it your data and it predicts. That’s a bold promise that deserves testing!

Back when I started learning about machine learning, one of my first projects was tackling Kaggle’s Titanic Challenge in which you get a dataset with information about the passengers of the Titanic and you have to predict which of them survived. It’s not a classic SAP use case, but I think it makes a good first benchmark: To score well, you typically need to clean data, engineer features, and iterate on your model. Would RPT-1 perform well on raw data? Could I improve results by tuning the input? To find out, I decided to board the Titanic again. Here’s what I discovered.

Spoiler Alert: RPT-1 performed well, and I learned a lot about how RPT-1 works, including a bit of mystery and surprise. If you are looking for technical details, I have posted Coding Walkthrough on GitHub. In this blog post, I will focus on the insights and lessons learned. So let’s jump right in.

How RPT-1 works

If you haven’t worked with RPT-1 before, here’s the basic idea: It uses in-context learning, similar to how Large Language Models work. Instead of training a model on your data, you show it examples, and it learns the patterns on the fly without any traditional training step. Think of it like showing a child examples before asking them to solve a new problem.

For the Titanic challenge, this means we give RPT-1 a table of passengers where we know if they survived or not. Then we add one or more row(s) for passengers where survival is unknown, masking the “Survived” fields with a [PREDICT] placeholder. Therefore, the model looks at the patterns in the known data and fills in the blanks.

Here’s what a simple (and simplified) payload looks like: (For the full payload structure, please refer to the Coding Walkthrough.)

{
    "prediction_config": {
        "target_columns": [{
            "name": "Survived",
            "prediction_placeholder": "[PREDICT]"
        }]
    },
    "index_column": "PassengerId",
    "rows": [
        {"PassengerId": 1, "Pclass": 3, "Sex": "male", "Age": 22, "Survived": "0"},
        {"PassengerId": 2, "Pclass": 1, "Sex": "female", "Age": 38, "Survived": "1"},
        {"PassengerId": 3, "Pclass": 3, "Sex": "female", "Age": 26, "Survived": "[PREDICT]"}
    ]
}

The model sees that passenger 1 (male, 3rd class) didn’t survive, passenger 2 (female, 1st class) did. Now it can use these patterns (women and higher classes survived more often) to predict passenger 3. Pretty elegant, right?

Here is an example of how the prediction result looks like:

[
  {
    "PassengerId": 3,
    "Survived": [
      {
        "confidence": 0.89,
        "prediction": "1"
      }
    ]
  }
]

RPT-1 therefore predicts that passenger 3 survived with a confidence of 89%. A few things stand out here: First, you didn’t have to train a model to get this result. Second, that confidence score is actually meaningful. Sure, you could ask an LLM “how confident are you?” and it would happily give you a number, but that’s just another token prediction, making up a plausible-sounding answer. RPT-1’s confidence comes from the actual probability distribution over possible values produced by the model, which is a fundamentally different (and more useful) thing.

Experimenting with Training Data

Since Kaggle limits you to 10 submissions per day, and I wanted to run a lot more experiments, I initially planned to only work on the training data where we know the ground truth, masking some “Survived” values, and see if RPT-1 could predict them. This way I could measure performance without burning through my daily submission quota.

Technically, my plan worked perfectly. The results, however, were… suspicious. When I masked 400 out of 891 rows and asked RPT-1 to fill them in, it predicted with 98% accuracy. To put that in perspective: the simple baseline of “women survive, men don’t” gets you about 76%. Anything above 80% is highly competitive, and no serious Kaggle submission breaks 90%. A traditional Gradient Boosting model I trained on the same data scored around 80%. So either RPT-1 is doing something magical, or something fishy is going on.

My first thought: Maybe RPT-1 has seen the Titanic dataset during training and is simply regurgitating the answers? To test this, I tried predicting passenger names instead of survival. Names are completely random, so if RPT-1 memorized the dataset, it should nail them too - but it didn’t. Instead, it predicted names from the unmasked rows in its context window, with very low confidence. A similar test with ticket numbers (which have some patterns since families shared tickets) showed modest results. RPT-1 was clearly pattern-matching from context, not memorizing from training.

But that still didn’t explain the 98% accuracy on survival predictions. So I ran one more experiment: What happens when I severely limit the context? I kept only 2 unmasked rows (the minimum allowed) and tried three variations:

One survivor, one non-survivor in context → reasonable predictions
Two survivors in context → predicted everyone survived
Two non-survivors in context → predicted everyone died

The result was reassuring: RPT-1 was only predicting values it had seen in the context window (same as we had seen with names and ticket numbers). It also “resisted” creating a near-perfect result (scoring 68%) when both “0” and “1” were in the context, even though this would have opened the door for regurgitation.

When I tested how much context RPT-1 needed to produce competitive results, it already scored 85% on the test data with only 20 unmasked entries in the context window (surprisingly high!). With only 20 examples out of 891, the model shouldn’t have enough signal generalize well and to outperform Gradient Boosting trained on 491 examples. Put simply, it improved too much. Why? Honestly, I still don’t fully understand it. The mystery remains unsolved. If you have a theory, I’d love to hear it. But rather than speculate further, I decided to move on to what really matters: How would RPT-1 perform when it actually had to predict unknown outcomes (the test set)?

The Real Test: Kaggle Submissions

Time to put RPT-1 to the real test. I combined the training data (with known outcomes) and test data (with [PREDICT] placeholders), sent it to RPT-1, and submitted the predictions to Kaggle.

With raw data and zero preprocessing, RPT-1 scored 0.76555, matching the gender-based baseline. If you’ve tried this competition yourself, you know that’s actually not trivial. Many first attempts fall short of even this baseline.

Next, I tried some standard feature engineering: Extracting titles from names, creating family size indicators, binning ages. Surprisingly, the score barely budged (0.76794). In fact, adding just the Title feature actually dropped the score to 0.76076. This suggested RPT-1 was already extracting similar signals from the raw data, or perhaps the additional columns were adding noise.

Knowing that higher score should be achievable, I created more advanced features that capture social context, for example survival rates of passengers sharing the same surname or ticket number. Indeed, the score increased significantly to 0.78229. But how does this compare to traditional machine learning approaches? Let’s find out.

How Does This Compare to Traditional ML?

To compare RPT-1’s scores (small and large) with traditional ML, I trained Gradient Boosting models along the way for comparison using the same engineered features we gave to RPT-1. Here is the full result table for the different setups:

Setup	RPT-1-Small	RPT-1-Large	GradientBoosting
No feature engineering	0.76555	0.76555	—
With title feature	0.76076	0.76076	—
Full feature engineering	0.76794	0.76794	0.76555
Advanced feature engineering	0.77990	0.78229	0.77751

Interestingly, both model sizes scored identically on simpler setups, with RPT-1-Large only pulling ahead with advanced features. This may indicate that the small model might be good enough for a dataset of ~1000 lines, but clearly this needs more testing with different use cases. Overall RPT-1-Large wins, even if it’s only by ~0.3% compared RPT-1-Small or ~0.5% compared to Gradient Boosting, this is a real difference in this competition.

But the scores only tell half the story. The more interesting story is how RPT-1 wins. When I first trained Gradient Boosting with the advanced features, it actually scored lower than the baseline — classic overfitting. I had to add cross-validation, tune hyperparameters, and simplify the feature set to get to 0.77751. That’s the normal ML workflow: train, evaluate, iterate, repeat.

Using RPT-1 I could just add the new features and got results. No retraining, no hyperparameter tuning, no overfitting headaches. The same features that caused Gradient Boosting to overfit worked fine for RPT-1 out of the box.

This advantage compounds in real-world scenarios. Imagine your customer data changes weekly. With traditional ML, that’s weekly retraining jobs, validation runs, and deployment cycles. With RPT-1, you just update the context. Whether this holds for enterprise use cases remains to be tested, but the potential is significant and it would fundamentally simplify the operational model.

Conclusion

RPT-1 didn’t just survive the Titanic test, it came out a winner. What can we learn from this exercise? I think we need to look at this from two angles: How does RPT-1 compare to traditional ML models and how does it compare to LLMs.

SAP positions RPT-1 as a replacement for traditional ML models. And it does deliver on that promise. Without any training, it matched and slightly outperformed a tuned Gradient Boosting model. That’s remarkable. You can throw structured data at it and get competitive predictions without a typical Machine Learning workflow: No model selection, no hyperparameter tuning, no train-test splits to manage.

Feature engineering still matters, but differently. Basic features like extracting titles barely moved the needle, RPT-1 already seemed to pick up these patterns from the raw data. But features capturing deeper structure (like group survival rates in this case) made a real difference. Zooming out, this saves time and effort when implementing a use case, but meaningful features which focus on domain insights and which aren’t obvious still are worth engineering.

We could also see an additional workflow advantage when Gradient Boosting overfit on my advanced features. To really make use of the new features, I had to iterate and add cross-validation, tune parameters, simplify the feature set. With RPT-1, the same features just worked. This is nice for quick experiments (like Titanic), but when projecting this forward into production systems that need regular updates with new data, that’s potentially transformative.

SAP also calls RPT-1 a foundation model (for tabular data), so how does it compare to an LLM? As we have seen its output is quite different from that of an LLM, but it is worth noting that, unlike LLMs, RPT-1 produces deterministic results. You can run the same payload multiple times and get the same result. This is a big advantage over LLMs, where you always get slightly different answers. Additionally, RPT-1 produces real confidence scores, something an LLM cannot do. Both of these points clearly make RPT-1 a different kind of model, giving you LLM-style in-context learning, but with ML-style predictive results.

The 98% mystery on masked training data remains. If you have a theory, I’d love to hear it. Nonetheless, my overall impression of RPT-1 is very good and I am looking forward to moving beyond the small Titanic dataset to more enterprise-like use cases to see how RPT-1 performs on more complex scenarios. What enterprise scenarios would you like to see tested?

Reuse

CC BY 4.0