Skip to content

Sql optimization trick in F#

I’m doing a project in F# where we do Business Intelligence on lots of data that lives in Sql Server. The general form is to iterate over a subset of patients, and for each patient, read the entire timeline of events to find points of interest. As you can imagine, with a database of 1.5M patients and over 100M timeline events, doing a database read for every patient gets prohibitively expensive. One run can take about 30 hours, and most of that time is spent roundtripping to Sql Server.

Pseudo code:


let patients = read "SELECT * FROM Patients"
for p in patients do
   let timeline = read ("SELECT * FROM Treatment WHERE SSN = " + p.SSN)
   ... do something with p + timeline

The optimization trick is this: We can replace all those extra reads with a single read. The data would not fit in memory, so to keep memory usage low, it needs to be done in a forward-only manner.
The trick is to order the data of both queries by the same key, SSN in this case. SQL Server is pretty good at this sort of thing if you provide the correct indexes. If we know that data is ordered, we could keep the timeline reader around and advance it just as much as needed for each patient.

I won’t go into the tedium of reading from Sql here.
We can test without bothering to read from the database:
 


// Simple 'patient' data, just an id here, but could be anything
let patients = [1;2;3;4]

// Simple 'timeline' data, contains the id from patients and some additional data
let timeline = [1,"event for first patient"
                1,"some more"
                1,"even more"
                3,"event for third patient"]

We want to be able to iterate patients and find all timeline rows for each patient in a forward-only manner.
For this, we create a function that takes a patient and returns the timeline rows.
 

// Create the function that can be used to get detail rows, given a master row.
// Type of 'key must be comparable
let getRowsFn (detail:'b seq) 
              (masterkey: 'a -> 'key) 
              (detailkey: 'b -> 'key) = 
    let enumer = detail.GetEnumerator()
    let hasRow = ref (enumer.MoveNext())
    (fun (master:'a) -> 
        [
           let mkey = masterkey(master)
           // Skip ahead for as long as detailkey is less than what we're looking for
           while !hasRow && detailkey(enumer.Current) < mkey do
              hasRow := enumer.MoveNext()
           // For as long as the keys match, return current detail row and proceed to next detail row
           while !hasRow && detailkey(enumer.Current) = mkey do
              yield enumer.Current
              hasRow := enumer.MoveNext() ]), 
     { new IDisposable with 
         member this.Dispose() = enumer.Dispose() }


// Usage
// pass in the timeline data, 
// 'id' because the patient is the key,
// and 'fst' because the key in timeline is the tuples first element
let getTimeline, disp = getRowsFn timeline id fst

for patient in patients do
   let timeline = getTimeline patient
   printfn "%d: %A" patient timeline

// Output:
// 1: [(1, "event for first patient"); (1, "some more"); (1, "even more")]
// 2: []
// 3: [(3, "event for third patient")]
// 4: []

Note that getRowsFn is completely generic and does not know anything about patients or timelines. getRowsFn takes three arguments:

  • detail – The detail seq, timeline in this case
  • masterkey – A function that takes a patient and returns the key to use for comparison
  • detailkey – A function that takes a timeline and returns the key to use for comparison
  •  
    It returns two things:

  • function (‘a -> ‘b list) – the function to use for getting detail rows.
  • IDisposable – The function needs to know when it can get rid of the enumerator. Call Dispose() on this object when you are done.
  •  
    The sample code uses lists for test data, but runs equally well on data from Sql Server if you expose your data as sequences.
    With this in place my sample run went from 30+ hours to half an hour and still hardly consumes any memory.
     
    Let’s be clear: This is an optimization. Code readability suffers and stuff gets tangled up. If you don’t need the performance, don’t do it. You could argue that this type of data processing is a prime candidate for a document database, and you’d be right. Still, with source data in Sql Server, you’d need to do something like this in order to create that document database.
     
    Another drawback with this approach is if you have for example three hierarchical levels instead of two. You can compose this, but note that your third level of data must be sorted accordingly. Say, if you have the following hierarchy:

  • Customer – key CustomerId
  • Orders – key OrderId
  • OrderDetail – key OrderDetailId
  •  
    If Customers are sorted by CustomerId,
    Orders would need to be sorted by CustomerId / OrderId,
    and OrderDetail would need to be sorted by CustomerId / OrderId / OrderDetailId.

    Advertisements

    F# on SqlClr

    I thought I’d start this blog off with a post on how I got my F# code to run in SqlClr. Why would you want to do this? If you’re dealing with huge amounts of data and need to process that data in ways that are not suited to Sql, it might help performance-wise if your code can be ‘closer’ to the data. Do read more about it: Introduction to SQL Server CLR Integration

    I was starting this new project that requires lots of data analysis and I really wanted to use F# instead of C# because of the many benefits it brings.

    Let’s begin. Create a database to house your stored procedure. In order to run Clr code on it, run the following script against the database. We will need to deploy FSharp.Core which is not marked as trustworthy by Sql Server. That is why we run the ‘set trustworthy on’ bits.

    
    -- Enable Sql Clr
    sp_configure 'clr enabled', 1
    reconfigure
    go
    
    -- Avoid having to sign assemblies and worry about other security issues
    -- Dont forget to change 'DatabaseName' below.
    alter database DatabaseName set trustworthy on
    go
    
    

    Create a new F# Library. Make sure you target .Net 2.0 as this is what current versions of Sql Server run. The upcoming Sql Server ‘Denali’ will run .Net 4.0.

    Now add the code for a simple stored procedure:

    namespace SqlClr
    open System
    open System.Data
    open System.Data.Sql
    open System.Data.SqlTypes
    open System.Data.SqlClient
    open Microsoft.SqlServer.Server
    
    type SqlClrSample =
       [<SqlProcedure()>]
       static member Hello (x:string) =
          SqlContext.Pipe.Send("Hello " + x)
    

    This code doesn’t require much explanation. We create a static method that has an SqlProcedureAttribute applied to it. SqlContext.Pipe.Send sends a message directly to the client. There are other interesting methods in SqlContext.Pipe, but I’ll leave that for later.

    Build the assembly. Now we must register the assembly in Sql Server in order to run this code. We also need to deploy any referenced assemblies. In our case, there is one, FSharp.Core.
    When you do SqlClr projects in C#, Visual Studio will deploy the assembly for you. In F#, we will need to do this ourselves. Make a Deploy.sql file in your project. I like to keep this file in the project even though it doesn’t get built, so that it is versioned along with the rest of the project. Sql script for registering follows.

    -- First, drop everything in backwards order
    IF  EXISTS (SELECT * FROM sys.procedures procs WHERE procs.name = N'Hello')
        DROP PROCEDURE Hello
    GO
    
    IF  EXISTS (SELECT * FROM sys.assemblies asms WHERE asms.name = N'SqlClr')
        DROP ASSEMBLY [SqlClr]
    GO
    IF  EXISTS (SELECT * FROM sys.assemblies asms WHERE asms.name = N'FSCore')
        DROP ASSEMBLY [FSCore]
    GO
    
    -- Now, create everything in forwards order
    CREATE ASSEMBLY FSCore FROM 'C:\Program Files (x86)\Reference Assemblies\Microsoft\FSharp\2.0\Runtime\v2.0\FSharp.Core.dll' WITH PERMISSION_SET = UNSAFE
    GO
    CREATE ASSEMBLY SqlClr FROM 'D:\robert\dokument\visual studio 2010\Projects\SqlClr\SqlClr\bin\Debug\SqlClr.dll'
    GO
    
    -- External name is [SqlAssemblyName].[Full typename].[Method name]
    CREATE PROCEDURE [dbo].[Hello]
       @name [nvarchar](64)
    AS
    EXTERNAL NAME [SqlClr].[SqlClr.SqlClrSample].[Hello]
    GO
    
    

    Some things of note:

    • We need to manage dependencies ourselves. This means that in order to be able to repeat this when we make changes, we need to drop everything we created the last time in backwards order. Procedure ‘Hello’ can not exist without assembly ‘SqlClr’, which in turn can not exist without assembly ‘FSCore’. If this is done in the wrong order the script will simply fail. Once everything is dropped, we create everything in forwards order.
    • There are hard-coded paths in this script. I haven’t really found a good workaround for this yet. Suggestions are welcome!

    Run this script against your database. Now you should be able to run your procedure:

    Now, there are more useful forms of Sql Clr code and I’ll hopefully have more to say on the subject at a later date. Sample project is here.

    Update: Niels Berglund informs me that we don’t have to use SqlProcedureAttribute from F# as it is only used for deployment from VS. I’ll still keep them in my project for clarity.