SQL FILESTREAM - Giving a bit of structure to your unstructured data

When it comes to familiarizing yourself with any new technology there are always two parts. First, you need to understand its intended use case, and second you need to understand how to use/implement this technology in practice (and, for your own good, you should do it strictly in that order). In this article I want to talk about SQL Server FILESTREAM feature making sure that both parts are covered.

Main benefit of any DBMS is an ability to store data in a structured way facilitating information retrieval combined with capabilities to manage your data, ensuring that you have such things as backups and security controls which can be applied on the level of one management entity – database. You have some of these features with modern file system level storage, but DBMS capabilities are richer.

Frequent business requirement you may encounter is requirement to store some large files so that they are linked/associated with all other business information which is stored in your LOB (line of business) database. Depending on business, these files can be anything from scanned originals of legal documents to audio, video, drawings or any other possible proprietary file format which stores data you need. Historically, there was no way of storing these files or, in other terminology, BLOBs (binary large objects) in SQL Server database. Although file system serves well for storing files, it lacks in management capabilities if compared with those provided by DBMS and potentially leaves your business data in two disconnected silos: database and file system. This requirement drove search for ways of joining those two disparate data silos into one or at least connecting them in some way.

SQL Server FILESTREAM is one of the technologies allowing you to do this. But let’s look at what we had before first. Prior to availability of SQL Server FILESTREAM feature possible answers to the problem described above were:

1) Storing data in table storage/VARBINARY(MAX) columns. This approach provides all the benefits of having all your data inside of database, including ability to ensure transactional consistency (meaning that BLOB data stored it the database subject to all database constraints you defined) and simplified management. Downsides of this approach are complicated access to files for external applications and bloat of database size which may lead to overuse of premium storage space or violating database file size limits (for Express Edition).

Conclusion: with this approach your unstructured data is integrated into / stored in database with all storage and data management capabilities, but it cannot be retrieved using conventional file system APIs (your client or intermediate layer application need to take care about this).

2) Storing links to the files locations on the file system in the database. This approach is free from downsides of one we just considered, but it also free from its advantages. Using this approach means that data is still stored separately (good thing you can access it via filesystem APIs) but you are not getting any DBMS data management capabilities available for this data (you rather getting a problem of taking care about keeping in sync your database and file system storage combination, having two objects which you have to manage separately creating double work).

Conclusion: Viable and simple solution which leaves BLOBs outside of DBMS control and only acceptable if you can afford associated management overhead and risks of missing unified controls over your data (e.g. transactional consistency).

Enters FILESTREAM. This SQL Server feature allows varbinary(max) columns to store the data on the filesystem through use of specially designated file groups which contain directories acting as data containers. Thus, we have data stored as files but integrated directly into DBMS so that we can use backup and recovery and integrity protection features both for BLOB files and structured data.

FILESTREAM feature was first introduced in SQL Server 2008 (known under codename Katmai and released in August 2008) and currently available in any edition of any current version of SQL Server. As you can guess from description above FILESTREAM is not a datatype but rather a setting/attribute you configure on varbinary(max) column which tells SQL Server that data must be stored on the file system. This feature integrates SQL Server DBMS with NTFS/ReFS file system. That gives you ability to use Transact-SQL statements to operate on this data (insert, update, query, search, backup) combined with ability to use streaming APIs and performance of file system and all this maintaining transactional consistency both for structured and unstructured data.

FILESTREAM improves DBMS performance as FILESTREAM columns not cached in SQL Server buffer pool, using the NT system cache instead. This saves more SQL Server buffer pool memory for query processing. With FILESTREAM enabled you can use both Transact-SQL and Win32 to manage the FILESTREAM data.

Although data is stored in the form of files (and potentially accessible via conventional filesystem APIs) we are still missing full Windows application capability with FILESTREAM as data cannot be accessed using non-transactional streaming and Windows applications are not aware of database transactions. It also means that there is no full compatibility with the file systems API for file and directory data. As FILESTREAM Win32 streaming support works only in the context of SQL Server transactions we cannot delete or rename FILESTREAM files through the file system.

Conclusion: With FILESTRAM you get transactional consistency and ability to use streaming APIs and performance of the file system. Filesystem APIs/streaming support is limited by transactional mode only. This option is better than solutions which we had before combining benefits of SQL Server management and file system performance.

Any downsides/limitations? Limitation here is that using FILESTREAM implies writing some custom code to allow Windows client applications to access this data. FILESTREAM does not have full Windows API file and directory storage compatibility and does not support non-transactional access.

With FILESTREAM Win32 streaming support works in the context of a SQL Server transaction. Within a transaction, it is possible to use FILESTREAM functions to obtain a logical UNC file system path of a file. Next you can use the OpenSqlFILESTREAM API to obtain a file handle. This handle can then be used by Win32 file streaming interfaces, such as ReadFile() and WriteFile(), to access and update the file by way of the file system.

When you may want to use it? To put it simply use of FILESTREAM can be considered in the following scenarios:

When you need to store files larger than 1 MB (use conventional table storage for smaller files)
When fast read access is important
When you have database file size limits imposed by SQL Server edition limitations which you need to overcome (e.g. Express edition has maximum database size limit of 10 GB, i.e. you have 10 GB to store relational data, but FILESTEAM data is not a subject to this restriction)

To sum it up, if your LOB database needs to include large files you can leverage file stream to optimize performance and have benefits of managing this data as a part of DBMS.

Now with intended use case and limitations of FILESTREAM technology clarified we can move on to practical side of things.

First of all, FILESTREAM needs to be enabled on the SQL Server instance level. It is not enabled by default during installation or upgrade of SQL Server instance, though you can enable it on the Database Engine Configuration step in the SQL Server Setup Wizard:

Normally FILESTREAM configuration performed post installation using SQL Server Configuration Manager and SQL Server Management Studio (SSMS). To check if it is enabled you can use sp_configure system stored procedure:

If your config_value set to 0 it means that FILESTREAM is disabled and you need to set it config_value to 2 to enable it. Alternatively, you can verify configuration state navigating to SQL Server Configuration Manager > SQL Server Services selecting your SQL Server Instance and accessing its properties where you will find FILESTREAM tab:

We can enable FILESTREAM using Transact-SQL/SSMS and SQL Server Configuration Manager.

Let’s enable FILESTREAM using sp_configure stored procedure as shown below:

Executing this command will change your config_value from 0 to 2:

You may be a little bit confused by the message you will receive after changing FILESTREAM_access_level which says: “Configuration option ‘FILESTREAM access level’ changed from 0 to 2. Run the RECONFIGURE statement to install.” If you try to run RECONFIGURE after this, you will see the following message: “FILESTREAM feature could not be initialized. The operating system Administrator must enable FILESTREAM on the instance using Configuration Manager.”

Your next step after changing configuration with sp_configure will be switching over to FILESTREAM tab of instance properties in SQL Server Configuration Manager and ticking Enable FILESTREAM for Transact-SQL access as well as Enable FILESTREAM for file I/O access checkboxes, adjusting Windows share name if necessary (it defaults to MSSQLSERVER in case of default instance or to instance name in case of named instance) optionally you may tick Allow remote clients access to FILESTREAM data:

Click OK in Instance properties and restart your SQL instance to get this setting applied:

Once restart completed you can execute EXEC sp_configure FILESTREAM_access_level command again and see that we have both config_value and run_value set to 2:

Now we can add FILESTREAM enabled file group and file and start using FILESTREAM. To add FILESTREAM file group you can use SSMS GUI (Database Properties > Filegroups):

And T-SQL can be used to do the same as well:

Once you executed commands shown above you will see that FILESTREAM filegroup has been added to your database:

You will also see that folder you specified in SQL script is created with FILESTREAM.hdr file and $FSLOG folder inside:

FILESTREAM.hdr file used to store FILESTREAM header information (metadata describing data container) and $FSLOG directory acts as a FILESTREAM equivalent of the transaction log. Subsequently subdirectories will be created in this folder for each table and inside of those there will be subdirectories foreach FILESTREAM column within the table with GUIDs acting as a file names for all these file system objects.

We can now create table with FILESTREAM column:

Note that for FILESTREAM table you must have column marked as ROWGUIDCOL which allows you to have immutable column separate from the primary key, otherwise on attempt to create table you will receive this error message: “A table that has FILESTREAM columns must have a nonnull unique column with the ROWGUIDCOL property.”

And once table is created we can either save existing file into our FILESTREAM column or convert some text into binary format and save it:

You will see that our binary data was stored in database and GUID was assigned to each record as ID:

You can access these files through file system, but there is no easy way to locate files within GUID based folder structure with files named with GUIDs and having no extension. Still you can locate the file and open it from file system (it is especially easy in the beginning when you have just a few files stored in FILESTREAM container):

The main limitation of this feature is requirement to write custom code to locate and open file. As explained in Microsoft documentation you can use Win32 APIs to create client applications for FILESTREAM data, in particular to obtain Win32 handle to read data from FILESTREAM blob. The OpenSqlFILESTREAM API obtains a Win32 file handle which application can use to stream the FILESTREAM data, and can further pass to the following Win32 APIs: ReadFile, WriteFile, TransmitFile, SetFilePointer, SetEndOfFile, or FlushFileBuffers. Keep in mind that, to access the FILESTREAM BLOB by using Win32, Windows Authorization must be enabled (i.e. you must use Integrated Authentication with FILESTREAM). You can find some relevant code samples in Access FILESTREAM Data with OpenSqlFILESTREAM section of Microsoft documentation.

To conclude this article, I must mention that starting from SQL Server 2012 another option for storing BLOB in the database is available. This option is built on top of FILESTREAM and known as FileTables. FileTables introduces specialized user table with pre-defined schema which stores FILESTREAM data along with directory hierarchy information and file attributes. FileTables address shortcomings of FILESTREAM feature mentioned above, i.e. provides more complete Windows API compatibility and non-transactional access to files. You may read up about selecting between those two options in Compare Options for Storing Blobs (SQL Server) section of Microsoft documentation, and I think more detailed overview of FileTables can be a good topic for one of the next articles.

SQL Server FILESTREAM – Giving a bit of structure to your unstructured data