Hmm, how can this happen, my last post is more than 12 months old … During this period some things have changed – it seems mostly to the better, and some things are still the same, I’m still in love with data.

My resolution for the year 2017, post more often 🙂

Generic Data Processing Problems

Over the last years I encountered similar problems and had to solve these problems with different tools, for this reason I started to call these kind of problems generic problems. These thinking has proven that I have become much more versatile in developing problem solving solutions and also faster (at least this thinking works for me).

My toolset has not changed in the way that I’m now using this tool instead of that tool, but has changed in the way that I’m now using more tools. My current weapons of choice are SQL (T-SQL to be precise), R, and Power BI.

One problem group is labeled “Subset and Apply”.

Subset and Apply

Subset and Apply means that I have a dataset of some rows where due to some conditions all the rows have to be put into a bucket and then a function has to be applied to each bucket.

The simple problem can be solved by a GROUP BY using T-SQL, the not so simple problem requires that all columns and rows of the dataset have to be retained for further processing, even if these columns are not used to subset or bucket the rows in your dataset.

Subset and Apply – Indexing Rows

What I want to achieve is to create a new column, that shows the rowindex for each row in its subset.

The SQL script, the R script, and the Power BI file can be found here:
https://www.dropbox.com/sh/y180d8q0rcjclxg/AADaHttgsAGBZg4vH55hMMYka?dl=0

I start with this simple dataset

base

And this is how the result should look like

result

I like the idea of ensemble modeling or decomposition, for this reason I came up with the following three separate parts that needed a solution

Create a new column in my dataset
Build subsets / groups / windows in my dataset
Apply a function to each of the subsets (I may have to consider some kind of sorting within each group)

Here are some areas of interest where you may encounter this type of problem

Feature Engineering, create features that represent a sequence for your statistical models
Create content for a slicer in Power BI, that enables you to compare the 1st order across all customers
Ensure that points have the correct order if you are tasked with creating complex shapes
…

Subset and Apply – Indexing Rows – T-SQL

Using T-SQL I’m going to use the OVER() clause and the ROW_NUMBER() function.

The complete T-SQL statement:

Select
basetable.*,
row_number() over(partition by basetable.customerid order by basetable.orderdate, basetable.orderid)
as rowindexinsubset
from
@basetable as basetable;

Because I started my data career with T-SQL this solution seems to be the most obvious.

The subsetting of the dataset is done by PARTITION BY and the ordering by the ORDER BY part of the OVER(…clause), the function that is applied to each subset is ROW_NUMBER()

Subset and Apply – Indexing Rows – R (data.table package)

Please be aware that there are many Packages for R (estimates exists that the # of packages available on CRAN will reach 10thousand) that can help to solve this task, but for a couple of reasons my goto package for almost any data munging task is “data.table”. The #1 reason: performance (most of the time the datasets in question are somewhat larger than this simple one).

Basically the essential R code looks like this:

setkeyv(dt, cols= c("orderdate", "orderid"));
dt[, rowindexinsubset := seq_along(.I), by = list(customerid)];

As I already mentioned, most of the time I’m using the data.table package, and that there are some concise documents available
(https://cran.r-project.org/web/packages/data.table/index.html),
please keep in mind that this post is not about explaining this package but about solving problems using tools, but nevertheless, just two short remarks for those of you who are not familiar with this package:

A data.table object is of the class data.frame and data.table, this means whenever a function expects a data.frame you can also use a data.table object
A data.table has its own intricate working in comparison to a data.frame for this reason I urgently recommend you to read the above mentioned documents if you are looking for a great new friend.

Basically a data.table operation is performed using one or all segments within a data.table reference:

dt[ … , … , … ]

The first segment defines the row-filter dt[ rowfilter , … , … ]

The second segment defines the column-operations dt[ … , column-operations , …]

The third segment, is a special segement where different keywords are specifying different things, e.g. the keyword by is used to subset the dataset.

Coming back to the decomposition of the problem this means that

by = list(customerid) performs the subsetting
seq_along(.I) is the function that is applied to each subset of the dataset
rowindexinsubset is the name of a new column that gets its values for each row from the rhs of the assignment expression in the column-operations segment

If the 1st row is omitted, we will see that obviously the “sorting” part is missing that orders the rows in the subset …

result-r-without-keys

Ordering a data.table or to be precise ordering the rows in a subset is not nearly as obvious as using the OVER(partition by … order by …), but therefore it is fast (this kind of fast that is fast as Iron Fist, even if we are talking about hundreds of millions of rows).

Using setkeyv(dt, cols = c(…)) orders the complete dataset by creating a sorted key, and if we think about it, we will come to the conclusion that this will has no impact on the order of rows in each subset, please have a closer look at the documents if you are more interested in keys in data.table objects (and be assured, you should be).

seq_along(.I), where seq_along(…) is a base R function and corresponds in this simple usage to the ROW_NUMBER() function of T-SQL, whereas .I is a data.table specific parameter, that means each row of the subset is exposed to the function.

Using both rows mentioned above returns the expected result – voila

result-r-with-keys

Subset and Apply – Indexing Rows – Power BI

The same output can be achieved quite easily and if you are somewhat familiar with the GetData component (trying to avoid to use the term Power Query whenever I’m talking about Power BI) it is also easy, but also not that obvious.

Starting with the same dataset …

base-power-bi

Coming back to the decomposition of the problem this means that

The subsetting can be performed quite easily, using the “Group By” – Transform
The application of a function to each row in a subset is not that obvious, but if you are a little familiar with reading the query script that is automagically created for each action you will discover this
Yes, you are right, I’m talking about the underscore. Replacing the underscore with two combined functions, walking the powerful Excel way (you may also call it – functional programming), and I’m already done
```
Table.AddIndexColumn( Table.Sort(_, {{"orderdate", 0}, {"orderid" , 0}} ) , "IndexInGroup" ,1,1 )
```
the command each applies the stacked functions to the calling object. The calling object in this case is a subset (the group created by the Table.Group function of the original dataset) that still contains all rows. The first function is Table.AddIndexColumn (the ROW_NUMBER() function of M) and the second function is Table.Sort(…).

The Power BI also contains a query that uses R to create the sample data and also uses R to create the result table, this query is called “completeR”.

I have to admit that I couldn’t resist to “recycle” my example from here https://minceddata.wordpress.com/2015/12/23/data-visualization-using-r-ggplot2-and-microsoft-reporting-services/ to create the same (almost the same) Data Visualization in combination with Power BI.

Screen 1

You can find the Power BI file (“Using ggplot2 with Power BI example.pbix”) here: https://www.dropbox.com/sh/uxlv7p1m3q6ol8f/AADh3XHUzZ8w7XQPHqV7MI5ua?dl=0

A little note:

I’m using R for Windows Version 3.2.3 (the “Wooden Christmas-Tree” version) and the latest Power BI Desktop Version that you can find here: https://www.microsoft.com/en-us/download/details.aspx?id=45331.

To enable the R visualization preview feature, follow these steps: http://blogs.msdn.com/b/powerbi/archive/2015/12/17/announcing-preview-of-r-visuals-in-power-bi-desktop.aspx

Please keep in mind, that the usage of R based visualizations within Power BI is a preview feature, reports that contain R visualizations can not be published to the Power BI service. Currently this is a Power BI Desktop Feature alone. I would not recommend, to use this in an production environment.

As I already mentioned, the R code is almost the same as in the example for Reporting Services. For this reason I will concentrate on the special stuff that leverages the power of Power BI (sounds a little funny).

One thing that makes data analysis with Power BI a real pleasure, is the interactivity between the data and the objects used for data visualizations.

For this reason I added some lines to the initial R script:

# read the number of companies (determined by the selection of the Comp slicer)

slicer <- unique(dt.source$Comp)

slicer.count <- length(slicer)

Here I’m determine the selected members (or all the members) and also counting the selected members (from the slicer in the Power BI report). This will become necessary to calculate the width and the proper coordinates of the rectangles.

# creating a vector that contains shades of grey

cols.vector <- NULL

cols.vector <- colorRampPalette(c(“grey80”, “grey50”))( slicer.count )

names(cols.vector) <- slicer;

With the lines above I create a vector that contains colors ranging from “grey80” to “grey50”. This vector will be used later on in the function scale_fill_manual(…). The vector cols.vector is named by the content of the variable slicer. A named vector is one of the fundamentals of R. So this concept / object should be understood for more advanced R programming. Maybe this little analogy will help. I imagine a named vector as some kind of key/value pair, where the key is the name and the value the value. For the example data this will lead to something like this:

key/value

“Company A”/”#CCCCCC”;

“Company B”/”#7F7F7F”;

Looking at the first geom_rect(…), the part … aes(…, fill = Comp …) … maps the content of the column Comp of the dataset to the fill-aesthetic. In combination with the vector cols.vector and the function scale_fill_manual we will become this colorful chart 😉

#determine the levels of the companies shown on the xaxis, to get a numerical value that is used later to draw the rectangles, for each category (Monthname) n-slicer columns are drawn

dt.cast$slicer <- factor(dt.cast$Comp, levels = unique(dt.cast$Comp));

slicer.levels <- levels(dt.cast$slicer);

names(slicer.levels) <- c(levels(dt.cast$slicer));

dt.cast$slicer.level.ordinal <- as.numeric(match(dt.cast$slicer, slicer.levels));

The following lines add new fields to the dataset, these fields are used to determine the proper coordinates of the rectangles, drawn with the geom_rect(…). The geom_rect(…) provides a little more flexibility, than the geom_bar. Especially if you want to interfere with the width of a bar.

In contrast to the example for Reporting Services I use the number of selected members from the Comp slicer to hide or show some details of the chart, this means if there is more than one slicer selected, than I hide some chart details.

if(slicer.count > 1){

…

,chart.columnwidth.default <- 0.9

} else {

…

,chart.columnwidth.default <- 0.6

};

The line below segment.width … creates the anchor value for the width of a bar. Starting by 0.9 (meaning there is a little gap between different categories (the month). The unit of this measure is a relativ measure, you should be aware that a value greater 1, will lead to overlapping bars (even if one can consider data as beautiful, not necessarily art will evolve by this).

#the default width of a bar

#chart.columnwidth.default <- 0.9

segment.width <- chart.columnwidth.default / slicer.count

This value in combination with the number of selected members of the slicer will lead to the proper coordinates for the rectangles.

geom_rect(

…

,aes(

xmin = …

,xmax = …

,…

)

C’est ca!

I have to admit, that I’m totally awed by the possibilities of the R integration not just into SQL Server 2016, but also to Power BI! I guess some of the my future posts will be more on the “analytics” side of R, instead of the Data Visualization part 🙂

Thanks for reading!

	Subset And Apply Pro… on Solving Generic Data Processin…
	Hung Baselice on Data Visualization – Usi…
	Roxane Catrambone on Data Visualization – Usi…
	Data Visualization –… on Data Visualization – Usi…
	tommartens on Data Visualization – Usi…

Mincing Data

Tag Archives: PowerBI

Solving Generic Data Processing Problems – using SQL, R and Power BI