EVOLVING DATA SHARING PRACTICES IN CAPITAL MARKETS (PART 3)

Originally published on TabbFORUM

At the start of our series, we highlighted the scale of the bad data problem within capital markets as well as the critical need to rethink data sharing tools and processes in use today. Here, we’ll compare some of the most common methods of transferring data in financial markets, including examples of where they’re implemented. This exploration will cover the advantages and challenges of commonly used technologies, provide insight into why certain tools are favored over others, and where improvements are needed to better serve the industry’s needs.

Key Highlights

While data sharing methods like SFTP, APIs and SaaS data warehouses have their specific use cases, there remains a need for continued innovation in data sharing tools to address gaps and meet evolving industry requirements. Below is a summary of the primary protocols in the market today.

Secure File Transfer Protocol (SFTP): The Legacy Approach

In capital markets and beyond, there’s a constant need for firms to share large volumes of bulk data between each other. SFTP emerged in the late 1990s and gained prevalence in the early 2000s as the financial industry increasingly prioritized secure data transmission. Globally, the SFTP and data sharing market ranges from $2-4 billion and continues to grow every year with the rapid rise in machine learning and the increasing need for data analytics in financial services. While this method has the benefit of broad familiarity and relative simplicity to set up, it comes with various challenges for firms, particularly within capital markets.

For starters, the data typically has to undergo multiple hops and transformations – from the source system to a reporting database and onto various intermediate servers – before reaching its final destination. Data integrity is often hard to maintain given ETL processes involved and the potential for corrupted or incomplete files. Engineers who manage SFTP servers have to handle ongoing maintenance for schema changes, password updates, and PGP key updates, which can impact data accuracy and require further reconciliation. Critically, SFTP transfers occur periodically (typically once per day), resulting in latency that prevents real-time data consumption and hinders timely, data-driven decision-making.

Another reason firms have gravitated towards this method is that batch files allow for high customization, delivering unique reports to each user. However, this customization requires end users to build custom processes to maintain the same reporting structure, increasing complexity for the data producer to manage each users’ modifications.

Nonetheless, batch files over SFTP is the current de facto protocol (see Exhibit 1) and is commonly used by major clearinghouses or similar market infrastructure companies who manage complex position and risk data.

Exhibit 1

Since batch files typically contain the full set of data and often require some form of customization for certain users, they are usually made available only at the end of the day. This means financial institutions are left without clarity on their outstanding risk until after markets are closed. Intraday positions are based on best estimates until an overnight reconciliation process begins to try to align the books before the next day’s cycle begins anew.

The file types available for download through this method are generally CSV or PDF format, which are a translated version of the true database record that exists within the market infrastructure provider’s systems. Every end user has to custom build a process to ingest that data, creating substantial integration and maintenance costs. It’s not uncommon for a large market participant to receive dozens of daily reports, each with a different cut of their data, for a single asset class from a single market infrastructure. All of these translations and custom builds can become rather unwieldy and prone to error.

Application Programming Interface (API): Flexible Yet Limited for Bulk Data

REST APIs were introduced in the early 2000s, and the more modern GraphQL API in 2012; these have quickly become the standard method for exchanging data over the internet. These two approaches are used for exchanging data over HTTP, the standard communication protocol of the internet. As the demand for real-time (or near real-time) data has increased within capital markets, firms have gravitated towards APIs for certain use cases.

These APIs provide common methods to access data and, based on the URL that is called, will run queries against the database (see Exhibit 2).

Exhibit 2

The result of the query is returned in a response message to the caller, typically encoded in JSON, a popular data exchange format understood by most software languages and systems. Exhibit 3 (below) shows an example of a JSON response when the Twitter API is called.

Exhibit 3

APIs provide this data in near real-time, which is a major benefit over the batch file approach. Additionally, using a format such as JSON makes APIs more flexible than FIX and allows users to model much more complex data than is possible in the tabular formats (such as CSV) typically used by batch files.

Despite their benefits and prevalence across the internet, APIs are not the primary method for sharing data in capital markets, specifically in post-trade, where batch files dominate. This is partly due to inherent challenges with building APIs on a legacy system, which can be a costly initiative in light of authentication, session management, scaling, load balancing, query optimization, caching, pagination, and rate limiting considerations. The most likely culprit, however, is that the high volume of data in capital markets can dwarf other industries and APIs were never designed to transfer high volumes of data in bulk.

If data needs to be transferred in high volumes and/or if low-latency (the time to return data) is required, an API may not be adequate. A key factor in query latency is the distance between the requester and the database. To reduce this latency, often a copy of the data is created and stored locally or as close as possible to the requester. Additionally, when requesting high volumes of data (gigabytes or more), a well-designed API paginates the response (breaks it into chunks of a few hundred or thousand rows at a time) and may not be able to handle the volume. A growing use case for low-latency queries requiring local data copies is real-time exception management applications.

Based on empirical evidence, even when market infrastructure firms invest in building APIs for their end users, they still need to rely on SFTP for bulk data transfers. APIs are typically limited to, and are well-suited for, specific queries such as specific trade statuses or database lookups and are not suitable for handling the high volumes of data that SFTP provides.

Data Warehouses: Scalable but Imperfect

In recent years, SaaS solutions like Snowflake and Databricks have revolutionized data management globally. These platforms emerged to address the need for scalable, efficient, and user-friendly data management and sharing solutions. Snowflake, founded in 2012, and Databricks, founded in 2013, have quickly gained traction due to their robust capabilities and ease of use.

One of the primary advantages of using platforms like Snowflake is the ease of setting up data shares once the data is loaded into the system. Users don’t need to be software engineers to write basic SQL statements and create data shares. This democratization of data access allows business users to engage in data analytics without heavy reliance on IT or DevOps teams. Additionally, once data is in Snowflake, there is no need to create additional copies, thereby avoiding increased storage costs. This is particularly beneficial for large datasets, as it reduces redundancy and associated expenses.

However, there are several challenges associated with SaaS data warehouses. Firstly, data must be loaded into the Snowflake system, for example, before any sharing can occur. This initial loading process often involves existing ETL tools which can be costly, and introduce delays. Both the data producer and consumer must be signed up for the same SaaS service (e.g., Snowflake), which can be a limitation for firms with varied infrastructure preferences. Moreover, extracting and internalizing the data out of the warehouse into a local database requires additional work and pipeline maintenance.

Additionally, these platforms typically provide views into the data rather than moving the data itself. While this approach minimizes storage costs and ensures data consistency, it requires both parties to trust the cloud service provider’s ability to securely host and manage access to their data. Cybersecurity risks are paramount, as firms must be confident in both the SaaS provider and the underlying cloud infrastructure’s security protocols.

Despite these challenges, SaaS data warehouses are gaining share in capital markets. While they come with their own challenges, the ability to offer real-time analytics and ease of use make them a compelling option for firms looking to modernize their data sharing processes.

Final Thoughts

As the data-sharing landscape continues to evolve, each method brings its own set of strengths and weaknesses, tailored to meet specific needs within the capital markets. While traditional protocols like SFTP and APIs have served well in various capacities, the advent of SaaS data warehouses like Snowflake and Databricks introduces new efficiencies and challenges. Despite recent advancements, there remain gaps in the overall toolset for data sharing, pointing to the need for continued innovation. Meanwhile, regulators are also paying close attention to sell-side data management practices, evidenced by recent fines against large banks for inadequate data quality and controls.