Tag Archives: SQL Server

Solving Generic Data Processing Problems – using SQL, R and Power BI

January 7, 2017 tommartens 1 Comment

Hmm, how can this happen, my last post is more than 12 months old … During this period some things have changed – it seems mostly to the better, and some things are still the same, I’m still in love with data.

My resolution for the year 2017, post more often 🙂

Generic Data Processing Problems

Over the last years I encountered similar problems and had to solve these problems with different tools, for this reason I started to call these kind of problems generic problems. These thinking has proven that I have become much more versatile in developing problem solving solutions and also faster (at least this thinking works for me).

My toolset has not changed in the way that I’m now using this tool instead of that tool, but has changed in the way that I’m now using more tools. My current weapons of choice are SQL (T-SQL to be precise), R, and Power BI.

One problem group is labeled “Subset and Apply”.

Subset and Apply

Subset and Apply means that I have a dataset of some rows where due to some conditions all the rows have to be put into a bucket and then a function has to be applied to each bucket.

The simple problem can be solved by a GROUP BY using T-SQL, the not so simple problem requires that all columns and rows of the dataset have to be retained for further processing, even if these columns are not used to subset or bucket the rows in your dataset.

Subset and Apply – Indexing Rows

What I want to achieve is to create a new column, that shows the rowindex for each row in its subset.

The SQL script, the R script, and the Power BI file can be found here:
https://www.dropbox.com/sh/y180d8q0rcjclxg/AADaHttgsAGBZg4vH55hMMYka?dl=0

I start with this simple dataset

base

And this is how the result should look like

result

I like the idea of ensemble modeling or decomposition, for this reason I came up with the following three separate parts that needed a solution

Create a new column in my dataset
Build subsets / groups / windows in my dataset
Apply a function to each of the subsets (I may have to consider some kind of sorting within each group)

Here are some areas of interest where you may encounter this type of problem

Feature Engineering, create features that represent a sequence for your statistical models
Create content for a slicer in Power BI, that enables you to compare the 1st order across all customers
Ensure that points have the correct order if you are tasked with creating complex shapes
…

Subset and Apply – Indexing Rows – T-SQL

Using T-SQL I’m going to use the OVER() clause and the ROW_NUMBER() function.

The complete T-SQL statement:

Select
basetable.*,
row_number() over(partition by basetable.customerid order by basetable.orderdate, basetable.orderid)
as rowindexinsubset
from
@basetable as basetable;

Because I started my data career with T-SQL this solution seems to be the most obvious.

The subsetting of the dataset is done by PARTITION BY and the ordering by the ORDER BY part of the OVER(…clause), the function that is applied to each subset is ROW_NUMBER()

Subset and Apply – Indexing Rows – R (data.table package)

Please be aware that there are many Packages for R (estimates exists that the # of packages available on CRAN will reach 10thousand) that can help to solve this task, but for a couple of reasons my goto package for almost any data munging task is “data.table”. The #1 reason: performance (most of the time the datasets in question are somewhat larger than this simple one).

Basically the essential R code looks like this:

setkeyv(dt, cols= c("orderdate", "orderid"));
dt[, rowindexinsubset := seq_along(.I), by = list(customerid)];

As I already mentioned, most of the time I’m using the data.table package, and that there are some concise documents available
(https://cran.r-project.org/web/packages/data.table/index.html),
please keep in mind that this post is not about explaining this package but about solving problems using tools, but nevertheless, just two short remarks for those of you who are not familiar with this package:

A data.table object is of the class data.frame and data.table, this means whenever a function expects a data.frame you can also use a data.table object
A data.table has its own intricate working in comparison to a data.frame for this reason I urgently recommend you to read the above mentioned documents if you are looking for a great new friend.

Basically a data.table operation is performed using one or all segments within a data.table reference:

dt[ … , … , … ]

The first segment defines the row-filter dt[ rowfilter , … , … ]

The second segment defines the column-operations dt[ … , column-operations , …]

The third segment, is a special segement where different keywords are specifying different things, e.g. the keyword by is used to subset the dataset.

Coming back to the decomposition of the problem this means that

by = list(customerid) performs the subsetting
seq_along(.I) is the function that is applied to each subset of the dataset
rowindexinsubset is the name of a new column that gets its values for each row from the rhs of the assignment expression in the column-operations segment

If the 1st row is omitted, we will see that obviously the “sorting” part is missing that orders the rows in the subset …

result-r-without-keys

Ordering a data.table or to be precise ordering the rows in a subset is not nearly as obvious as using the OVER(partition by … order by …), but therefore it is fast (this kind of fast that is fast as Iron Fist, even if we are talking about hundreds of millions of rows).

Using setkeyv(dt, cols = c(…)) orders the complete dataset by creating a sorted key, and if we think about it, we will come to the conclusion that this will has no impact on the order of rows in each subset, please have a closer look at the documents if you are more interested in keys in data.table objects (and be assured, you should be).

seq_along(.I), where seq_along(…) is a base R function and corresponds in this simple usage to the ROW_NUMBER() function of T-SQL, whereas .I is a data.table specific parameter, that means each row of the subset is exposed to the function.

Using both rows mentioned above returns the expected result – voila

result-r-with-keys

Subset and Apply – Indexing Rows – Power BI

The same output can be achieved quite easily and if you are somewhat familiar with the GetData component (trying to avoid to use the term Power Query whenever I’m talking about Power BI) it is also easy, but also not that obvious.

Starting with the same dataset …

base-power-bi

Coming back to the decomposition of the problem this means that

The subsetting can be performed quite easily, using the “Group By” – Transform
The application of a function to each row in a subset is not that obvious, but if you are a little familiar with reading the query script that is automagically created for each action you will discover this
Yes, you are right, I’m talking about the underscore. Replacing the underscore with two combined functions, walking the powerful Excel way (you may also call it – functional programming), and I’m already done
```
Table.AddIndexColumn( Table.Sort(_, {{"orderdate", 0}, {"orderid" , 0}} ) , "IndexInGroup" ,1,1 )
```
the command each applies the stacked functions to the calling object. The calling object in this case is a subset (the group created by the Table.Group function of the original dataset) that still contains all rows. The first function is Table.AddIndexColumn (the ROW_NUMBER() function of M) and the second function is Table.Sort(…).

The Power BI also contains a query that uses R to create the sample data and also uses R to create the result table, this query is called “completeR”.

DataVisualization, ggplot2, R, Reporting Services, SQL Server

Data Visualization – Using R ggplot2 and Microsoft Reporting Services

December 23, 2015 tommartens 3 Comments

This post is the 3rd post in a series on how to use the R package ggplot2 to create data visualizations, but before delving into R code here comes a little confession.

For a couple of years (decades) I’m an avid user of the SQL Server Data Platform, spanning of course the relational database engine designing and implementing DWHs, but also building analytical applications using MS SQL Server Analysis Services Multidimensional and Tabular. But for a couple of reasons I never get along with the Reporting Services as a platform for Business Intelligence solutions on top of the data stores. With the upcoming release of SQL Server 2016 and Microsoft’s BI strategy (disclosed at the PASS Summit 2015) this will change dramatically.

With the upcoming release of SQL Server 2016, Reporting Services will become the foundation for on premises BI architectures, after the initial release it will also become possible to publish the interactive Power BI Desktop (disclaimer: I’m addicted) Reports and Dashboards to Reporting Services. In the current release of SQL Server 2016 there is already a new web service available (free of Sharepoint) that will host the reports and there already is the possibility to design Dashboards for mobile devices. The technology of DataZen (acquired by Microsoft some time ago) is integrated into Reporting Services.

The above mentioned in combination with the integration of the statistical framework R into the SQL Server 2016 makes this post not just the 3rd in a series on how to create Data Visualizations with R (using ggplot2), but also the first post that describes the benefits of the integration of R into the SQL Server. In this special case how to use R visualizations within SQL Server Reporting Services.

Due to the fact that Microsoft is moving with an unbelievable pace, this post will become somewhat lengthy. This is because this post will of course describe some R scripting to create another Data Visualization, but will also be the beginning of a couple of posts that describe how and much more important why one should use R in combination with SQL Server.

Please be aware, that whenever I mention SQL Server or Reporting Services. I’m referring to the currently available pre-release of SQL Server 2016, known as SQL Server 2016 CTP 3.2

You can find all files in this Dropbox folder https://www.dropbox.com/sh/wsrgpsb6b6jfl6a/AABwsnNFN_djU8i5KUEEHWp7a?dl=0

There is a R script “hichertexample without reporting services.R” that you can use without the Reporting Services, just from within your favorite R IDE. Just be aware that the R script tries to acces a ODBC source called “rdevelop”. To create the table that is used by the R script execute the SQL script from “HichertExample.sql”.

To avoid any confusion, do not use any script, file or anything else I provide on a production environment.

First things first, the data visualization!

This time my custom R visualization was inspired by an older Hichert chart (the inventor of the International Business Charting Standards – IBCS: http://www.hichert.com/). The Excel File that I used for this example can be found here: http://www.hichert.com/excel/excel-templates/templates-2012.html#044A.

We are (I’m) using bar charts to compare qualitative variables, like the chart below:

Actual Values are compared with Budget Values, sometimes Actual Values are missing. In the chart above, Actual Values are missing for the months October to December.

Depending on the use case for the visualization it is also somewhat difficult to identify if the Budget Values are above or below the Actual Values. Not to mention, if there is a trend in the achievement of the Budget Values.

For this reason, I like the charting type below, it easily communicates the gap between the Budget Values and the Actual Values and also visually indicates that Forecast Values were used for the months October to December by using a dotted line for the x-axis.

The chart above shows the Budget Values (grey column) and Actual Values (black line) and the difference between the Budget Values and Actual Values / Forecast Values (months October to December) as a colored rectangle. A red rectangle indicates lower Actual Values and a green rectangle higher Actual Values.

The Budget Value is labeled at the root of the columns, whereas the Actual Values are labeled above or below the Actual Value indicator (the black line).

Now, it’s time to delve into some R code.

There is some kind of data preparation necessary, here is a little screen that depicts the structure of the data:

I call this kind of structure “long format” in contrast to “wide format”. If one wants to visualize data using ggplot2 it’s always a good idea to have the data in the “long format”. To give you an idea if your data has the “long format” or the “wide format” just use this simple rule (truly oversimplified): long format means “more rows” whereas “wide format” means “more columns”. If you are familiar with the SQL operators PIVOT and UNPIVOT, you will know what I’m talking about.

By the way – this sample data can be created using the file “HichertExample.sql”. I would not recommend to execute this SQL statement in a production environment, even if I took great care not to DROP any objects that were not created within the same statement.

Please be aware that I will explain somewhat later in this post how to pass datasets and other variables to a R script that will be executed within the SQL Server (almost within).

But for this special case I want to have my data in the wide format, due to the fact that I will create some additional information depending on the values of the column “DataType”. I call the process of transforming a long dataset into a wide dataset: pivoting (transforming rows into columns).

For this reason I use the following line in my R script:

dt.cast <- dcast(dt.source,Comp + Monthname ~ DataType, value.var = “Revenue” , fun = sum, fill = NA);

This function uses the data.table object “dt.source” and creates a new object “dt.cast”. After applying this function to the data.table “dt.source”, the new data.table will look like this:

Since version 1.9.6 the R package data.table provides the function “dcast” to pivot columns, for this reason it is no longer necessary to use the package reshape2.

For a couple of reasons I always use the package “data.table” for all kinds of data preparation, mainly because of performance and not to lose my focus. I’m aware that there are packages available that also can be used, but I will use the package “data.table” until there will be a package that does all the stuff in a fraction of time (happy waiting!).

As one can see, there are no Actual Values for the month Oct, Nov, and Dec. For this reason, I will create two additional columns that will be used during the charting:

dt.cast[, UseForecastValue := ifelse(is.na(Actual), 1,0)];
dt.cast[, ComparisonValue := ifelse(is.na(Actual), Forecast,Actual)];

The first line creates the column UseForecastColumn and assigns the value 1 if the Actual Value is missing. The second line creates the column ComparisonValue and assigns the value from the Actual column if a value present and from the Forecast column if not.

The next lines ensure that the data.table dt.cast is properly sorted by an index number assigned to a MonthName:

dt.monthordered = data.table(

Monthname = c(“Jan”,”Feb”,”Mar”,”Apr”,”May”,”Jun”,”Jul”,”Aug”,”Sep”,”Oct”,”Nov”,”Dec”),Monthindex = c(1,2,3,4,5,6,7,8,9,10,11,12));
setkey(dt.monthordered,Monthname);
setkey(dt.cast, Monthname);
dt.cast <- merge(dt.cast,dt.monthordered, by = “Monthname”);
setorder(dt.cast, Monthindex);

The first line creates the data.table “dt.monthordered”. The next two lines create an index for the column in the data.tables.

Using merge in the next line combines both data.tables and finally the data.table “dt.cast” gets ordered by the column MonthIndex.

The rectangles, depicting the difference between the Actual / Forecast Value and the Budget Value, are drawn using the geom_rect(…), therefore it is necessary to determine the height of the rectangle, this is done by the next line:

dt.cast[, diffCompareBudget := ComparisonValue – Budget][] #diff ComparisonValue-Budget;

Simply a new column is created “diffCompareBudget”.

The next lines create a numerical index for the values shown on the xaxis, this index is stored in the column “category.level.ordinal”. This value is used later on to determine the coordinates along the xaxis for the rectangles:

dt.cast$category <- factor(dt.cast$Monthname, levels = unique(dt.cast$Monthname));
category.levels <- levels(dt.cast$category);
names(category.levels) <- c(levels(dt.cast$category));
dt.cast$category.level.ordinal <- as.numeric(match(dt.cast$category, category.levels));

Then there are some lines creating boolean values to tweak the chart, this is one of great possibilities provided by the ggplot2 package, build the chart in layesr that can be addes or omitted, just by encapsulating these layers in a simple if-clause:

show.BudgetValues <- TRUE;
show.BudgetLabels <- TRUE;
show.ActualLabels <- TRUE;
show.diffValues <- TRUE;
show.diffLabels <- TRUE;
show.actLine <- TRUE;
show.zeroline <- TRUE;

And of course then the drawing of the chart begins 🙂

Basically this is straightforward …

First an ggplot2 object is initialized using

p <- ggplot();

Each of the used layer of the chart is contained within an if-clause that adds the layer to the chart or not 😉

If(){
P <- p + geom_…(…);
};

In preparation for the usage of this chart in a report of SQL Server Reporting Services the final chart is rendered as a binary object using thes four lines:

image_file = tempfile();
png(filename = image_file,width = 900,height=700,res=144);
print(p);
dev.off();
OutputDataSet <- data.frame(data=readBin(file(image_file, “rb”), what=raw(), n=1e6));

I guess the last line can be a little confusing, for this reason here is some additional explanation:

The stored procedure that executes the R script “sp_execute_external_script” returns an object, a result object. This object has to have the name “OutputDataSet”. The temporarily created file “image_file” is assigned to this result object. Due to the fact that the Reporting Services expect a binary, readBin(…) is used.

And now – some words about the R integration in SQL Server 2016 and the usage of custom Data Visualizations in SQL Server 2016 Reporting Services.

If you want to try this at home 🙂 You have to use this Version of SQL Server 2016: SQL Server 2016 CTP 3.2 (found here: https://technet.microsoft.com/de-de/evalcenter/mt130694.aspx)

During installation of SQL Server, make sure you checked the new “Advanced Analytics Extensions” option and also executed the post-installation process, that is described here (https://technet.microsoft.com/en-us/library/mt590808.aspx).

Due to the fact that custom Data Visualizations in SQL Server Reporting Services are image files (I like that), this part is just very simple. Provide something to Reporting Services that is a binary. I already explained how to create a binary (in my example a png file) from within the R script.

Check!

To leverage the power of R from within SQL Server basically a new system stored procedure “sp_execute_external_script” is executed from within another stored procedure. In my example the stored procedure “dbo.getHichertExample_plot” calls this new system procedure.

The dataset “getHichertExample” of the Report “HichertExample” queries its data using a stored procedure and passes the seleceted company as a parameter to the stored porcedure like this

exec getHichertExample_plot @comp ;

This stored procedure assembles the SQL statement that provides the input data for the R script and executes the new system stored procedure “sp_execute_external script”. This stored procedure and its parameters are described here (https://msdn.microsoft.com/en-us/library/mt604368.aspx). The usage of this procedure is quite simple, assuming you have some T-SQL experience.

To make all this work smoothly it is necessary to consider some security issues, meaning the privileges:

the user needs at least some kind of data access to the source table meaning for example being a member of the db_datareader role
the user has to be a member of the new SQL Server database role “db_rrerole”
the user needs a grant to execute the procedure “getHichertExample”
the user needs the privilege to execute external scripts

The SQL file “the security stuff.sql” contains the statements that are necessary to provide the proper access rights to make my example work, I guess this could become handy in the future as well.

And after publishing the report to your report server, the report will look like this in the new web portal of Reporting Services 2016:

This simplicity is astonishing; this combines the power of two outstanding platforms in a way that will at least provide topics for my future posts 🙂
Thanks for reading!

DataVisualization, ggplot2, R, Reporting Services, SQL Server

They walk in line SQL Server 2016 Reporting Services and R charting using ggplot2

November 3, 2015 tommartens 2 Comments

Some Weeks ago I started blogging, and started with the first part of a series about the not that obvious aspects of charting using the well known (not to say famous) R package gglot2 (developed by Hadley Wickham).

This post is not part of this series, but just due to my enthusiasm for the integration of R into SQL Server 2016 and the possibilities that come with this integration.

If you want to try it by yourself you can find the preview version of SQL Server 2016 here:

https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016

Please be aware that this is the CTP (Community Technology Preview) 3.0 and for this reason, you should not use this release in a production environment and also not on a machine that is used for development earning money to pay your rent. If you want to use R from T-SQL (meaning as an external script 🙂 ) please make sure that you select the feature “Advanced Analytics” within the feature selection list during installation.

There are also some sample files available:

https://www.microsoft.com/en-us/download/details.aspx?id=49502

The zip-archive “SQLServer2016CTP3Samples” contains the document “Getting Started.docx” in the folder “Advanced Analytics”. This document explains how to install the additional components that are necessary to get your R integration up and running (pretty straightforward explanation).

The above mentioned components can be found here:

https://msdn.microsoft.com/en-US/library/mt604883.aspx

My first experiment using the R integration from SQL Server 2016 CTP 3.0 was inspired by one of the older IBCS (International Business Charting Standards) Templates from 2012 that can be found here:

http://www.hichert.com/excel/excel-templates/templates-2012.html#044A

The result of my first experiment:

I hope that by the end of the week I have finished the 2nd part of the ggplot2 series and also the 3rd part that already explains how to create the chart above using R charting and SQL Server Reporting Services 2016.

Keep on charting, it’s Rsome 🙂

	Subset And Apply Pro… on Solving Generic Data Processin…
	Hung Baselice on Data Visualization – Usi…
	Roxane Catrambone on Data Visualization – Usi…
	Data Visualization –… on Data Visualization – Usi…
	tommartens on Data Visualization – Usi…

Mincing Data