Skip to main content

My journey from SQL Server to MongoDb

I recently had a chance to introduce MongoDB in one of my projects. This application was a data collection system. Its main purpose was to be able to ingest large amounts of data from different data sources. These sources had different formats for data. Some were in XML, some were in CSV and some were in binary. They all had one thing in common though and that was the large size of data.

The core of the application was written in a multi-threaded .NET application that was using SqlBulkCopy with a SQL Sever 2012 addition. SqlBulkCopy is one of the preferred (optimized) ways of loading data into SQL Server. The goal of the application was to be able to parse any given file with its given format definition/configuration file. Therefore, it needed to be able to store data into SQL Server that it never stored before. This provided challenges since Relational Data Management Systems require you to define your data structure before you can store data (i.e you must create your tables). There are obviously ways of creating tables on the fly using stored procedures/functions etc. in order to store data. The application requirements dictated that we could not create tables on the fly. We ended up storing all of our datasets into a single database table. We used sql_variant type to define this large number of column table (we had a "dataset" table with 256 columns). Needless to say, this really was not elegant solution since it had a static limit on the number of data points that could be collected. We used Spring.Batch style streaming to parse these large datasets (files). Once these large datasets started coming in we started running into issues especially related to SQL Server and Lock Escalation. I have already written a quick article on how we ran into this issue and how we ended up dealing with it (with SQL Server).

I started recognizing that the problem we had was a good match for MongoDB or document based NoSQL systems since they are schema-less and do not require you to define your data structures before your data is inserted. At the time, I was enrolled in M101J and M102 classes from https://university.mongodb.com/. These classes provided a lot of details about what MongoDB was, what it was not, and what it could do for your Big Data challenges.

I am not going to go into detail on what MongoDB is in this article. You can find a lot of information about it on their web site.

I put together a very quick prototype for the management and the rest of the team showing how MongoDB would be able to help us eliminate some of the issues we ran into. Before implementing any solution with MongoDB (or anything else for that matter), I think it is important to understand its limitations. MongoDB as I mentioned earlier is a document based database. This document structure can contain basic properties, arrays and other complex MongoDB documents. The representation of a document is in JSON. At the time of this writing, MongoDB documents can not be larger than 16MB. This obviously can be challenge. There are also limitations to size of a collection. A collection in MongoDB is analogous to a table in SQL Server.  MongoDB modeling supports relationships between collections. Its able to do this by either embedding a document in another document or by linking to another document. Once I was familiar with the limitations, and MongoDB's modeling approaches, I designed my data model.

The parsing framework I built had the concept of writers (much like Spring.Batch). In our arsenal, we had a composite writer. It's main job was to chain other writers. I ended up writing a MongoDBItemWriter and configured our parsers to use this writer. This writer's logic was very simple. It first created a collection for the type of data (or datasource) if one was not previously created, and used MongoDB C# driver API to save the data into that collection.

Just like SqlBulkCopy, MongoDB has a way to efficiently save data in a bulk fashion. MongoCollection object in the C# driver has a method called InsertBatch that receives an IEnumerable. We built on this prototype and are almost ready to completely turn of our SqlBulkCopyItemWriter. We have been running tests against it and are happy with the results we are seeing with regards to performance, memory etc. We are still using SQL Server in conjunction with MongoDB. 

I think it is important to notice that solutions like MongoDB are out there and can definitely help you solve and provide robust solutions to your large complex problems.


Comments