Fusion S

VoyageWorks' Synthetic Data



You’ve Got the Team—Now Get the Data

We heard you—you want to build models, refine pricing, and test ideas in-house, using your own team and tools. But to do that well, you need more than just your internal data. That’s why we created Fusion S, VoyageWorks' synthetic dataset: a high-quality, contributory dataset built to reflect real-world relationships between underwriting variables and loss outcomes.


It supplements your internal data with a broader industry view, giving you a credible foundation for segmentation, pricing experiments, and go-to-market strategies. The dataset includes detailed exposure and loss information, mirrors patterns observed across a large carrier-contributed pool, and is updated annually with fresh data from insurers like you.

Real answers about synthetic data

  • Why was the Fusion S™ created?

    The goal was to enable carriers and partners to build models, test pricing strategies, and explore segmentation ideas without needing access to sensitive or proprietary data. The synthetic data mirrors the characteristics of the pooled dataset but is different enough to protect against reverse engineering.

  • How is Fusion S™ constructed?

    VoyageWorks’ synthetic data is specifically designed to maintain the distributions of and the correlations between every variable in the dataset, especially between any predictor variable and frequency, severity, and pure premium.  VoyageWorks  conducts extensive testing to ensure that the synthetic data maintains variable distributions, correlations between variables, and  that models built using the synthetic data closely match models built on actual data. 


  • What source data was used to create the synthetic dataset?

    We used the liability, collision, and comprehensive modeling datasets from the VoyageWorks pool. Records with missing liability incurred amounts, missing NAICS codes, or fewer than 0.25 earned car years were excluded. Third-party variables and premium fields were also removed.

  • What is the state and industry distribution? 

    We can provide that information. Please reach out to us. 

  • How do you make sure the synthetic data behaves like the actual data?

    We ran multiple statistical tests, including:


    Variable distributions: Every variable in the synthetic data closely mirrors the distribution in the actual dataset.


    Numeric correlations: Relationships between numeric fields stayed highly consistent—no correlation changed by more than 10%.


    Categorical relationships: The frequency of category combinations (e.g., NAICS code + Vehicle Use) stayed within 2% of the actual data, with no unrealistic new combinations introduced.


    Box-and-whisker comparisons: Numeric variables were compared across levels of categorical variables to verify consistent patterns.


    Model behavior: We trained Random Forest and GBM models on both the real and synthetic datasets. Variable importance, actual vs. predicted values, and double lift charts all showed that the synthetic models behaved nearly identically.

  • How do customers validate the data to gain confidence in using it? 

    VoyageWorks builds models on both actual training data and synthetic training data. It then scores both models with the actual holdout data. The results of the synthetic model are nearly identical to the actual model, based on multiple tests. For any models built on synthetic data, VoyageWorks can score and produce validation charts using the real data. 



  • Does continued access to the dataset, including updates, depend on sharing data? 

    Continued access to the dataset, including updates, requires an annual data contribution.