From legacy silos to strategic data: Fixing third-party data fragmentation in asset management
Problem: Asset managers consume data from 30 to 100+ external vendors, leading to schema fragmentation, identifier mismatches, and a costly “reconciliation tax.”
Root cause: Centralized data quality programs fail because they lack the domain context to write meaningful rules and operate too slowly for post-ingestion batch runs.
Solution: A distributed validation model that embeds automated quality checks directly at the ingestion source, governed by shared master data standards.
Imagine this scenario: It’s 6:00 AM on the morning of a quarterly client reporting deadline. Your portfolio analytics team has just discovered that position data from your custodian, pricing data from Bloomberg, and fund classification data from your internal reference system are telling three different stories about the same holdings. The custodian is reporting a NAV that’s 0.3% higher than your internal calculation. Bloomberg has a stale price on a thinly traded corporate bond that nobody caught because the anomaly fell within your tolerance threshold, and your internal classification still has a CLO tranche bucketed as investment-grade credit when it was downgraded six weeks ago.
No one is going to run their model until this is resolved. The reconciliation work falls to two people on the data operations team who know the systems well enough to trace the discrepancy, and they’ll spend most of the morning on it.
Unfortunately, this kind of start to the day is more the rule than the exception. And the cost of it isn’t just hours of work lost, but an opportunity cost: decisions not made, models not run, and reports delayed while talented people chase down discrepancies in the data.
Large asset managers now consume data from 30 to 100+ external vendors, including custodians, prime brokers, fund administrators, market data providers like Bloomberg and Refinitiv, index providers such as MSCI and FTSE, ESG data vendors, and a growing stack of alternative data sources. Each vendor has its own schema, identifier system, update cadence, and its own idea of what constitutes clean data. Stitching these feeds together into something coherent is, according to Opimas research, part of a buy-side market data spend that now totals an estimated $3.1 billion annually, with the number of data sources per business still growing.
The fragmentation points are expected, but painful. Security identifiers are the clearest example: One vendor uses CUSIPs, another ISINs, another SEDOLs, and your internal system uses proprietary codes with no universal crosswalk. Entity data is even worse. Pricing sources disagree on illiquid instruments, and ESG ratings from MSCI, Sustainalytics, and Refinitiv regularly diverge by enough to change a fund’s classification. Fund taxonomy definitions don’t align across systems.
This is the real cost of reconciliation. Every new vendor relationship, fund strategy, and reporting requirement adds another line item to it. The problem is an architectural one, but the way most firms are trying to fix it was never designed to work at this scale.
Why centralized data quality programs stall in asset management
Most large asset managers have invested in data quality, including a team, platform, and set of processes. They have staffing and budget, but the dominant model (centralized data quality function, batch validation runs, downstream cleansing before delivery to consumers) was designed for a simpler environment and tends to break down in at least three distinct ways.
1. The vendor multiplicity problem. A centralized team physically cannot maintain meaningful validation rules for 50 different vendor feeds, each with its own schema, quirks, and update patterns. They end up writing generic rules that catch obvious failures but miss the nuanced quality issues that actually affect downstream accuracy. The alternative data vendor that delivers sentiment scores with a 48-hour lag during earnings season doesn’t fail a completeness check, but the data is useless for the strategy it’s feeding.
2. The latency trap. Centralized validation typically happens after data has been ingested, transformed, and often already consumed. A pricing anomaly caught in a morning batch run may have already propagated into a pre-market model execution or an automated report that went out overnight. The feedback loop runs too slow for the business, and by the time you know there’s a problem, it’s too late to prevent downstream impact.
3. The context gap. This is hardest to fix with better tooling alone. A centralized data quality team lacks the domain knowledge to write rules that are actually meaningful. Whether a 2% pricing discrepancy on a private credit instrument is a data error or an acceptable valuation spread depends on the asset class, the liquidity profile, the use case, and who’s consuming the data. A risk officer cares about different attributes of the same position than a portfolio manager does. Generic thresholds create noise, and at the same time, business-specific thresholds require domain knowledge that lives with investment teams, risk teams, and operations, not with a centralized data ops function that sits two organizational layers away.
The implication is that more centralized control doesn’t produce better quality in this environment. In fact, it often produces the illusion of quality while real issues slip through.
What asset managers actually need is validation controls embedded closer to where data enters the ecosystem, governed by shared standards, but configured and owned by the people who understand the data’s context.
Embedding quality at the source: A distributed validation model
The architectural shift isn’t complicated to describe, though it takes real effort to implement. Instead of a single quality gate that sits downstream of all ingestion points, you embed validation rules at each entry point where third-party data arrives. Each pipeline carries its own quality checks, calibrated to the specific vendor, the asset class, and the downstream use case. A governance layer then provides the common vocabulary and shared reference data that makes all of those distributed checks coherent rather than contradictory.
Three components make this work:
1. Automated validation at ingestion. Rules fire as data arrives from each vendor, including schema conformance, completeness checks, cross-reference validation against internal master data, and anomaly detection against historical patterns. Issues are flagged and routed before data enters downstream systems, and the feedback loop shrinks from hours to seconds.
This is the pattern that a US asset management firm has operationalized with Ataccama ONE, monitoring 300+ catalog items within their dbt pipelines and validating data against a 250-term business glossary before it reaches downstream consumers. Automated issue routing and lineage tracking mean that when a problem surfaces, it goes to the right person immediately, not into a queue that someone reviews tomorrow morning.
2. Business rule ownership by domain teams. The teams closest to the data define the thresholds that matter for their use cases. Portfolio managers set the tolerance bands for pricing discrepancies on the asset classes they manage, while risk teams define completeness requirements for position data feeding VaR models. Operations teams then own the reconciliation rules for NAV calculations.
The platform enforces the rules, and the business owns the logic. This is a meaningful distinction, because it removes the bottleneck where every new data issue requires an engineering ticket before anyone can do anything about it. With Ataccama ONE, domain experts can define and update business rules directly, without requiring engineering support for every change. It shrinks the gap between “we found a recurring data problem” and “we have a rule that catches it at ingestion” from weeks to days.
3. Unified governance as the connective tissue. Distributed validation without shared standards just creates a different kind of chaos. If every team defines “counterparty” differently, or uses a different identifier crosswalk, the distributed rules produce contradictory outputs. The governance layer solves this: A shared business glossary provides common definitions, standardized reference frameworks align how entities and instruments are identified across systems, and lineage visibility lets you trace any data point back to its original source.
This is where Ataccama’s data quality, metadata management, and reference data capabilities do the critical integration work, helping organizations align fragmented identifiers across vendors into a consistent, trusted reference framework that distributed validation rules can rely on. When the Bloomberg ISIN, the custodian CUSIP, and the internal proprietary code all resolve to the same golden record, the identifier mismatch problem that consumes so much reconciliation time stops being a recurring emergency and becomes a solved problem.
The ratio is important. This model is 70% architectural discipline and 30% tooling, where the tooling enables it, but the real change is organizational. Domain teams need to accept ownership of data quality in their area, and governance needs to provide shared standards without over-centralizing control.
From reconciliation to decision velocity: what changes when you fix the foundation
If the foundation works, the operational math changes in ways that compound over time. This might look like portfolio managers and analysts that can stop building shadow spreadsheets. When they trust that the data in the system reflects reality, they don’t need a manual cross-check before they run a screen or size a position. Research cycles will also start to compress. According to Coalition Greenwich’s research on data management in asset management, data-driven investment firms are already reallocating analyst time from data preparation to alpha-generating research as their data infrastructure matures.
What’s more, operational risk decreases in measurable ways. Automated validation catches vendor data issues before they reach NAV calculations, client reports, or regulatory filings. Exception-based workflows replace full-population manual reviews. The 6:00 AM scramble before a reporting deadline becomes less common, and eventually, rare.
The AI connection deserves direct treatment, because it’s where the stakes get higher. Asset managers are deploying quantitative models, NLP-based research tools, and increasingly, AI-assisted portfolio construction and risk analysis. Every one of these applications depends on the same upstream data that today feeds the reconciliation problem. A model trained or executed on data that hasn’t been validated against current vendor feeds, resolved across identifier systems, and traced to its source carries hidden risk that most teams don’t have good visibility into. The distributed validation model doesn’t just clean up the past, but creates the conditions for deploying AI with actual confidence in the foundation.
As firms expand into private markets, digital assets, and alternative data, the distributed model scales with them. Each new data source gets its own ingestion-point validation without requiring a redesign of the central architecture.
Where to start: Prioritizing your first moves
No company rewires its data architecture in a single initiative. The question is how to sequence the work so that early investments produce visible results and create the foundation for what comes next.
Start with an audit of your highest-impact data flows. Identify the three to five third-party feeds that generate the most reconciliation effort or carry the highest downstream consequence if they’re wrong. Pricing data feeding NAV calculations, position data feeding risk models, and entity data feeding regulatory reports are usually on the short list. These should be your first targets because fixing them produces results that are visible to the business and create organizational momentum.
Then, map the current validation gaps for each priority feed. Where do quality checks exist today? Where does manual process compensate for missing automation? This mapping exercise is also a risk inventory; it tells you where you’re one vendor data issue away from a reporting error.
After that, you can pilot embedded validation on a single high-impact pipeline. Implement automated validation at ingestion, measure the reduction in downstream exceptions, and track the time your team stops spending on manual reconciliation for that feed. One working example is more persuasive internally than a framework document.
In parallel, begin unifying your entity and instrument identifiers. The identifier fragmentation problem (CUSIP vs. ISIN vs. SEDOL vs. internal codes, with no crosswalk) is foundational. Every other validation rule you write will reference entity or instrument identity.
Ataccama works with asset managers through exactly this kind of phased implementation, starting with high-impact data quality monitoring and expanding into cataloging, lineage, and broader governance as the foundation matures. The phasing is important because it keeps the initiative grounded in real operational results.
The reconciliation tax is the compounded cost of an architectural mismatch between how third-party data actually flows through asset management companies, and how data quality programs were designed to manage it.
Want to see how Ataccama ONE helps asset management firms move from manual data reconciliation to an automated data trust layer? Speak with a specialist to learn more.
Anja Duricic
Anja is our Product Marketing Manager for ONE AI at Ataccama, with over 5 years in data, including her time at GoodData. She holds an MA from the University of Amsterdam and is passionate about the human experience, learning from real-life companies, and helping them with real-life needs.