Skip to content
Tags

, ,

Sql optimization trick in F#

2012-12-20

I’m doing a project in F# where we do Business Intelligence on lots of data that lives in Sql Server. The general form is to iterate over a subset of patients, and for each patient, read the entire timeline of events to find points of interest. As you can imagine, with a database of 1.5M patients and over 100M timeline events, doing a database read for every patient gets prohibitively expensive. One run can take about 30 hours, and most of that time is spent roundtripping to Sql Server.

Pseudo code:


let patients = read "SELECT * FROM Patients"
for p in patients do
   let timeline = read ("SELECT * FROM Treatment WHERE SSN = " + p.SSN)
   ... do something with p + timeline

The optimization trick is this: We can replace all those extra reads with a single read. The data would not fit in memory, so to keep memory usage low, it needs to be done in a forward-only manner.
The trick is to order the data of both queries by the same key, SSN in this case. SQL Server is pretty good at this sort of thing if you provide the correct indexes. If we know that data is ordered, we could keep the timeline reader around and advance it just as much as needed for each patient.

I won’t go into the tedium of reading from Sql here.
We can test without bothering to read from the database:
 


// Simple 'patient' data, just an id here, but could be anything
let patients = [1;2;3;4]

// Simple 'timeline' data, contains the id from patients and some additional data
let timeline = [1,"event for first patient"
                1,"some more"
                1,"even more"
                3,"event for third patient"]

We want to be able to iterate patients and find all timeline rows for each patient in a forward-only manner.
For this, we create a function that takes a patient and returns the timeline rows.
 

// Create the function that can be used to get detail rows, given a master row.
// Type of 'key must be comparable
let getRowsFn (detail:'b seq) 
              (masterkey: 'a -> 'key) 
              (detailkey: 'b -> 'key) = 
    let enumer = detail.GetEnumerator()
    let hasRow = ref (enumer.MoveNext())
    (fun (master:'a) -> 
        [
           let mkey = masterkey(master)
           // Skip ahead for as long as detailkey is less than what we're looking for
           while !hasRow && detailkey(enumer.Current) < mkey do
              hasRow := enumer.MoveNext()
           // For as long as the keys match, return current detail row and proceed to next detail row
           while !hasRow && detailkey(enumer.Current) = mkey do
              yield enumer.Current
              hasRow := enumer.MoveNext() ]), 
     { new IDisposable with 
         member this.Dispose() = enumer.Dispose() }


// Usage
// pass in the timeline data, 
// 'id' because the patient is the key,
// and 'fst' because the key in timeline is the tuples first element
let getTimeline, disp = getRowsFn timeline id fst

for patient in patients do
   let timeline = getTimeline patient
   printfn "%d: %A" patient timeline

// Output:
// 1: [(1, "event for first patient"); (1, "some more"); (1, "even more")]
// 2: []
// 3: [(3, "event for third patient")]
// 4: []

Note that getRowsFn is completely generic and does not know anything about patients or timelines. getRowsFn takes three arguments:

  • detail – The detail seq, timeline in this case
  • masterkey – A function that takes a patient and returns the key to use for comparison
  • detailkey – A function that takes a timeline and returns the key to use for comparison
  •  
    It returns two things:

  • function (‘a -> ‘b list) – the function to use for getting detail rows.
  • IDisposable – The function needs to know when it can get rid of the enumerator. Call Dispose() on this object when you are done.
  •  
    The sample code uses lists for test data, but runs equally well on data from Sql Server if you expose your data as sequences.
    With this in place my sample run went from 30+ hours to half an hour and still hardly consumes any memory.
     
    Let’s be clear: This is an optimization. Code readability suffers and stuff gets tangled up. If you don’t need the performance, don’t do it. You could argue that this type of data processing is a prime candidate for a document database, and you’d be right. Still, with source data in Sql Server, you’d need to do something like this in order to create that document database.
     
    Another drawback with this approach is if you have for example three hierarchical levels instead of two. You can compose this, but note that your third level of data must be sorted accordingly. Say, if you have the following hierarchy:

  • Customer – key CustomerId
  • Orders – key OrderId
  • OrderDetail – key OrderDetailId
  •  
    If Customers are sorted by CustomerId,
    Orders would need to be sorted by CustomerId / OrderId,
    and OrderDetail would need to be sorted by CustomerId / OrderId / OrderDetailId.

    Advertisements

    From → Code

    5 Comments
    1. Stuart permalink

      What you have described is the “select N+1 problem” http://stackoverflow.com/questions/97197/what-is-the-n1-selects-problem
      http://www.hibernatingrhinos.com/products/nhprof/learn/alert/SelectNPlusOne

      Have you tried just using a SQL “join” (in conjunction with indexing and ordering on the SSN key)?

      • Stuart, I did join first, but stopped for two reasons.
        The amount of unnecessary data transferred in my case would be huge. There are often 100 events for each patient, and there are fifteen or so fields. Thats 1500 fields transferred for no reason.
        It also made the code messier, because in my case I still needed to do something ‘per patient’ so I would have to sort that out client side if I used a join.

        • Stuart permalink

          rojepp, I see what you are saying, your pseudocode “do something with p + timeline” hides the specifics of your particular problem. Hand-crafting some SQL to only pull back fields you were interested in would solve the unnecessary data problem, but then again this involves writing SQL 😉 I like your solution and will keep it in mind if I see a similar problem in future.

        • Stuart,

          You get me wrong, I have nothing against writing Sql, in fact, in most cases I prefer it. But there’s no way to craft a join that will remove the duplication of data in the Patient table, as it were. Add to that the complexity of keeping track of what’s what on the client side. I could flesh out more details of the problem, I agree. 🙂

    Trackbacks & Pingbacks

    1. F# Weekly #51, 2012 « Sergey Tihon's Blog

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s