Author Archive

Using “cursors” in PDW

Hilmar Buchta

PDW v1/v2

Did I say cursor? Isn’t this an evil word? Shouldn’t we try as hard as possible to avoid them in database design and especially in a data warehouse?

Yes, sure. But there might be some patterns which make it useful to loop over a table, for example a configuration table, and do something with each line of the table. Since PDW currently doesn’t support cursors (why should it?), what can we do?

One option is to use the foreach-container in SSIS. It’s a good, reliable and easy way to implement loops.

However, if you need to, you can also do this using SQL. The following example shows a loop over all partitions of the FactSalesHeader table:

create table #FactSalesHeaderPartitions
WITH ( DISTRIBUTION = REPLICATE )
AS
SELECT sp.partition_number, prv.value AS boundary_value, lower(sty.name) AS boundary_value_type, sp.rows,
row_number() over (order by sp.partition_number) AS RowNr
FROM sys.tables st JOIN sys.indexes si ON st.object_id = si.object_id AND si.index_id <2
join sys.partitions sp ON sp.object_id = st.object_id AND sp.index_id = si.index_id
join sys.partition_schemes ps ON ps.data_space_id = si.data_space_id
join sys.partition_range_values prv ON prv.function_id = ps.function_id
join sys.partition_parameters pp ON pp.function_id = ps.function_id
join sys.types sty ON sty.user_type_id = pp.user_type_id AND prv.boundary_id = sp.partition_number
where st.object_id = (select object_id from sys.objects WHERE name = ‘FactSalesHeader’)

declare @part_count int = (select count(*) from #FactSalesHeaderPartitions)

declare @i int = 1

while @i <= @part_count begin

    declare @partition int
    declare @boundary int
   
    select @partition=partition_numnber, @boundary=boundary_value 
      from #FactSalesHeaderPartitions where RowNr=@i

   
    — do something with the partition, for example take it out, merge (ctas)
    — it with new data and then switch the resulting partition into the
    — fact table.
    
    set @i+=1
end

drop table #FactSalesHeaderPartitions

The idea is to put all the rows that you want to create a loop for in a temporary table, including a column with a row number. You can then create a simple while-loop starting with 1 and going up the the number of rows in the temporary table. Since the row number is also included in the temporary table, you can then select the correct values into your variables within the while loop.

Please note that the above SQL code for reading partitions uses PDW v1 AU 3.5 system tables. For other versions the table names may be different.

  • Share/Bookmark

Reminder for PASS SQLSaturday #230 (Germany)

Hilmar Buchta

Don’t miss the SQLSaturday on July 13, 2013. You can find more information about the location, registration and the agenda here:

http://www.sqlsaturday.com/230/

Also take a look at the agenda of the Pre-Conference (July 12, 2013) with interesting speakers and presentations: Here is the link for the details and the registration:

http://sqlsaturday230.eventbrite.de

  • Share/Bookmark

Connecting to PDW from PowerPivot

Hilmar Buchta

PDW v1/v2

When connecting to a PDW from PowerPivot, there might be some confusion about what to enter as the server name.

Confusion might even start at the very beginning when choosing the proper external data connection. After opening PowerPivot, you might assume to find the PDW connection behind the “From Database” ribbon icon. But you have to choose “From Other Sources” instead:

image

In the following dialog, choose “Microsoft SQL Server Parallel Data Warehouse” as the database type:

image

Next, a dialog opens to enter the connection details.

While you can choose any name you like for the connection name, the server name may be confusing as you usually enter a host name or IP address here. But for the PDW you have to enter the path to an IDS file containing the connection information:

image

How does such in IDS file look like? Let’s take a look at an example:

[Provider]
ProviderName=Microsoft SQL Server MPP OLE DB Provider
clsid={7D5C1E01-747C-4f39-8BEF-A88133706917}
[DSNInfo]
Description=MyPDWConnection
[Properties]
Host=192.168.27.23
Port=17001
Database=AdventureWorksDW
UseLDAP=0
DistinguishedName=
Encrypted=0
LoadBalancing=0
ConnectionRetryCount=0
ConnectionRetryDelay=3
AlternateServers=
DriverCompatibility=0
LogonID=mylogin
Password=mypassword
StatementFailover=0

Be sure to enter the correct IP address and port (default is 17001 for the TDS and 17000 for the Sequelink, but TDS is much faster), as well as the correct user/password.

  • Share/Bookmark

Distributed or replicated table? And what is important when choosing the distribution key?

Hilmar Buchta

PDW v1/v2

For large tables usually we’re looking for tuning options like creating an index or having a good partition strategy in place. For the Parallel Data Warehouse (PDW) additional decisions have to be made for the table layout.

Distributed or replicated?

The first decision is about the way the table is stored on the compute nodes. There are two options:

  1. Replicated
    All data of the table is available on all compute nodes
  2. Distributed
    The data of the table is distributed between the compute nodes

Distributing table data between the compute nodes follows the real nature of the MPP system. In PDW this distribution is established using a hash function on a specific table column which is referred to as the distribution key. In the other hand, replicated tables have their full content available on every compute node.

Creating a table in one of the two modes is quite easy:

Distributed table Replicated table

CREATE TABLE MyTable(
    ID int NOT NULL,
    …
)
WITH (DISTRIBUTION = HASH(ID))

CREATE TABLE MyTable(
    ID int NOT NULL,
    …
)
WITH (DISTRIBUTION = REPLICATE)

But when do we choose a distributed or replicated table?

As a rule of thumb, you will want to create tables which contain reference data, or – as we say in the data warehouse environment – dimensions, as replicated tables if they are not too big. The reason is simple. The typical data warehouse query will be a star join between the fact and the dimension tables with where-conditions on columns of the dimension tables, grouping (group by) on columns of the dimension tables and aggregations based on columns of the fact tables. So, if we distribute the large fact tables in order to leverage the full power of the MPP engine, having the dimensions on each compute node allows the compute node to answer the query with out needing data from other compute nodes. Let’s take a look at the following query based on a customer dimension which is linked to a sales table:

select Sum(SalesAmount) from FactSales
inner join DimCustomer On FactSales.CustomerKey=DimCustomer.CustomerKey
where DimCustomer.Region=’EMEA’

No matter on which key the FactSales table is distributed, having the DimCustomer table replicated means, that each compute node can individually compute the Sum of sales for the customers in the EMEA region. There still has to be a final aggregation for the results coming from each compute node, but in this case, this is just one line per compute node.

Also consider, that read/write is much faster with distributed tables (parallel process) compared to replicated table. This is one reason why replicated tables should be used for a smaller amount of data.

 

Choosing a good distribution key

The following aspects are important when choosing a distribution key:

  • What kind of workload do we have?
    (do we usually see lots of “atomic” reads, returning only a few rows, or do we more likely expect large scans and aggregates on the table)
  • What are the typically performed joins among the tables
  • What are the typically performed aggregations (group by) used on the tables?
  • How is the distribution key itself distributed?
    Choosing a distribution key which is unequally distributed will result in skew. The different distributions of the table should contain almost the same number of rows in order to have a good parallelization of queries over the compute nodes. If all the data sits on one compute node because of a bad distribution, this node becomes the bottleneck and you cannot expect a good performance.

I’m getting back to the challenge of finding a good distribution key in later posts.

Please keep in mind, that for both decisions, distributed vs. replicated and the distribution key, you don’t have to make a decision that lasts forever. In fact, it’s quite unlikely that you come up with the best solution at the very beginning. It’s quite easy to redistribute a table based on another distribution key or to turn a distributed table into a replicated one. For both scenarios, CTAS can be used. CTAS stands for Create Table As Select. This is quite similar to the Select into syntax on the SMP SQL Server. For example, if you want to change the distribution key of a table Sales you could follow these steps:

  • CTAS Sales to SalesNEW having the new distribution key
  • rename Sales to SalesBAK
  • rename SalesNEW to Sales
  • drop SalesBAK
  • Share/Bookmark

Big Data and Analytics

Hilmar Buchta

PDW v2 | Big Data

In my former post about Big Data, I used a “definition” which can be abbreviated as

“data that is too big for analysis within the required time”

The key aspects of this phrase are:

  1. size of data
  2. time frame for the analysis
  3. complexity of analysis

 

The time frame can be real time, near time, a few hours or maybe even days. This depends on the business requirement. The size of data may get bigger than expected because you need additional data sources (for example external data from market places) for your analysis. But today I’d like to focus on the third bullet point: the complexity of analysis.

If you don’t have complex analysis requirements and if you have plenty of time, you can process terabytes of data without any big data issues. Remember that storing a huge amount of data is not the big problem. But retrieving the data and doing analysis on this data is much more challenging.

But what are complex analytical computations? In SQL we can do a lot of computations by aggregating detailed values (sum, average, min, max etc.). And for many of the typical business performance indicators, this works quite well. But what about the following tasks:

  • Frequency analysis and decompositions (Fourier-/Cosine-/Wavelet transformation) for example for forecasting or decomposition of time series
  • Machine learning and data mining, for example k-means clustering, decision trees, classification, feature selection
  • Multivariate analysis, correlation
  • Projections, prediction, future prospects
  • Statistical tests (for example chi-squared or binomial)
  • Trend calculations, predictions and probability for certain trends or scenarios
  • Complex models involving simulations (for example Monte Carlo simulation for risk analysis)
  • binomial, normal or other types of distributions and density functions

 

For example, a decomposition of a time series into its main components may look like this:

image

(Source: R, temperature in Nottingham taken from the datasets library)

Decomposing time series can helpful to analyze periodicity and trends of sales data for example. This could be important for calculating the effect of promotions or to understand seasonal effects.

And this is just one example. As long as you can only slice and dice on your existing data, you’re always looking at the past. But in order to derive ideas and guidance for future decisions, higher sophisticated methods are required than just sum/group by. Some people even say, that this is where Business Intelligence starts. Everything else is just an analysis of the past which is also important, but there is so much more to find. The current discussion about data scientists clearly shows the rising demand for getting more out of your data. And to be honest, having a data scientist working just with a tool like Excel is like having Dr. House using just a folding rule as medical instrument instead of all the sophisticated laboratory instruments and equipment…it doesn’t work.

So, there are a lot of calculations that go far beyond the capabilities of traditional SQL. Therefore, we usually need to load the data from our data warehouse into some kind of analytical or statistical tool which is specialized such calculations. The results can then be transferred back into the relational database tables. As the focus of such tools differs from the focus of databases, these tools are usually separated from the database but offer interfaces (for example ODBC or flat file) to load data. Common tools are R, SAS, MatLab, just to name a few of them. R (http://cran.r-project.org), for example, is a toolset for doing advanced calculations and research level statistical computing. R is open source and can easily be extended using packages (libraries). Today, a huge amount of such packages exists for all kinds of different tasks.

However, when it comes to Big Data, the process of unloading all the required data can be very time consuming. So for Big Data analytics it’s important to bring both worlds together. This would be the perfect match. For doing so, the following two options are most promising:

  1. Using Hadoop (Map/Reduce)
  2. Using In-Database Analytics

 

Hadoop

PDW v2 offers a seamless integration with Hadoop using Polybase. This makes it easy and fast to export data on a Hadoop infrastructure. Research level analytics can then be performed on the Hadoop cluster. For this purpose, R supports distributed analysis and Map/Reduce jobs using the HadoopStreaming library. But we’re still copying the data out to the analytical environment, right? Yes, but in this scenario, each infrastructure is used in an optimal way:

    • PDW for high-performance SQL queries to combine and arrange the data in the format needed for analytical engines (more like a NoSQL format, for example to prepare variables for data mining).
    • Hadoop for distributed parallel computing tasks using Map/Reduce jobs
    • High performance (massive parallel) data transfer between the MPP (PDW) and Hadoop.
    • Transparent access of the analytical results using SQL (seamless integration of relational and non-relational data with Polybase)

Preparing the data for analytics can be a complex and challenging process. Usually data from multiple tables needs to be joined and filtered. Using SQL is the best choice for this task. For example, for preparing call center data for a mining model, it may be necessary to create variables (single row of data) that contain the number of complaints per week over the last weeks. This can then be used to build a decision tree. In SQL, this task is easy and in an MPP environment, we get the best performance for this task. For the decision tree we need to perform a feature selection at each node of the tree. This involves statistical functions and correlation which reach far beyond SQL. Using the analytical environment is the best choice for such advanced calculations. The resulting decision tree (rules, lift chart, support probabilities etc.) can then be stored as a file on the Hadoop cluster and from there being queried or imported back into relational database tables using Polybase.

 

In-Database Analytics

Another approach is to operate the analytical engine on the same platform and on the same data as the MPP database system. This approach ties both worlds together in a very consistent way but it’s currently not available on the PDW (although it is on my personal wishing list). However, in other MPP environments, this approach is not uncommon. For example, in SAP HANA you can write stored procedures in R just like this

CREATE PROCEDURE myCalc (…)

LANGUAGE RLANG AS

BEGIN

END;

The function body is then standard R code using R syntax, not SQL.

Typical features for In-Database Analytics include:

  • Analytical stored procedures
  • In-database analytics: direct access to database tables and views from the analytical engine without needing to load/unload the data
  • Tables/Views as parameters for the analytical functions (for example R data frames)
  • Full utilization of in-memory capabilities
  • Full utilization of the parallel query engine

 

Conclusion: In order to perform sophisticated analysis based on your BI data, SQL is not sufficient. Specialized toolsets like R are the the best solution. However, when it comes to Big Data, loading/unloading the data into these toolset may not be efficient anymore. A closer integration is necessary. Using Hadoop or In-Database Analytics are promising approaches for this scenario.

  • Share/Bookmark

What’s the buzz about MPP Data Warehouses (part 2)?

Hilmar Buchta

PDW v1/v2

In my first post I wrote about the need of a consequently tuned and aligned database server system in order to handle a high data warehouse workload in an efficient way. A commonly chosen implementation for this is a massive parallel shared nothing architecture. In this architecture your data is distributed among several nodes, each with their own storage. A central node processes incoming queries, calculates the parallel query plan and sends the resulting queries to the compute nodes.In a simplified form, this architecture looks as shown below:

image

Since different vendors choose different detail strategies, from now on, I’m focusing on the Microsoft Parallel Data Warehouse, or in short, the Microsoft PDW. The PDW is Microsoft’s solution for MPP data warehouse systems. The PDW ships as an appliance, i.e. as a pre-configured system (hard- and software) of perfectly compatible and aligned components, currently available from HP and DELL.

What happens if data is loaded into such a system? Let’s assume we have a table with 6 rows of sales and for simplicity, let’s assume we only have two compute nodes. In order to distribute the data among the compute nodes, a distribution keys needs to be chosen. This key will be used to determine the compute node for each row. Why don’t we just do a round robin distribution? I’m getting back to this point later in this post. The distribution key (table column) is used in a hash function to find a proper node. The hash function takes into account the data type of the distribution key, as well as the number of distributions. Actually, in PDW the table is also distributed on the compute node itself (8 different tables on different files/file groups) to get the optimal usage of the compute node’s cores and the optimal throughput to the underlying storage. For our example, let’s assume that the date values hash to the nodes as shown in this illustration:

image

As you see, each row of data is routed to a specific compute node (no redundancy). Doesn’t make this the compute node a single point of failure in the system?  Actually no, because of the physical layout of the PDW. In PDW v2 two compute nodes share one JBOD storage system, one of them communicating actively with the JBOD, the other using the infiniband network connection. The compute nodes itself are “normal” SQL Server 2012 machines running on Hyper-V. If a compute node fails, the data is still reachable using the second compute node that is attached to this JBOD. The compute nodes form an active/passive cluster, therefore the spare node can take over, if a node fails. The damaged node may easily be repaired or replaced. A Hyper-V image for a compute node sits on the management node (which I omitted in the illustration above). And again, this is just a very broad overview of the architecture. You can find very detailed information on the technology here: http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx

With the example above, what happens if we query a single date? Since we distributed the table on the date, a single compute node contains the data for this query. The control node can pass the query to the nodes and has no more action to take. The compute node containing the data can directly stream the data to the client. The same would happen if we run a query that groups by Date (and potentially filters by some other columns). Now both compute nodes can separately compute the result and stream the result to the client. What you see from this example is

  • In this case, two machines work in parallel and fully independently from each other
  • Since we distributed on Date and the query uses Date in the grouping, no post processing is necessary (we call this an aggregation compatible query)
    (if the distribution would have happened based on a round robin approach, there would never by an aggregation compatible query)
  • In order to get the best performance in this case, it’s important that the data is equally distributed between the compute nodes. In the worst case of all the data being queried sitting on only one of the compute nodes, this one node would have to do the full work. Choosing a proper distribution key can be challenging. I’m getting back to this in a subsequent post.

What happens if we run a query like the following?

select Sum([Sales Amount]) from Sales

Again each compute node can compute the individual partial result but now these results need to be send to the control node to calculate the final result (so called partition move). However, in this example, the control node gets much less rows to process compared to the total amount of rows. Imagine millions or billions of rows being distributed to the compute nodes. The control node in this example would only get two rows with partial results (as we have two compute nodes in our example). So, this operation still fully benefits from the parallel architecture.

image

And this works for most kind of aggregations. For example, if you replace sum([Sales Amount]) with avg([Sales Amount]), the query optimizer would ask the compute nodes for the sum and count and compute the average in the final step.

Ok, usually data models are more complicated (even in a data warehouse) than a single table. In a data warehouse we usually find a star (or snowflake) relationship between facts and dimensions. For the sales table above, this could look like this:

image

What happens now, if the query above is filtered by the product group, which is an attribute of the product dimension?

select sum([S.Sales Amount]) from Sales S
inner join Product P on S.ProductKey=P.ProductKey
where P.ProductGroup=’X’

How should we distribute the product table on the compute nodes in order to get good query performance? One option would be to distribute on the same key as the Sales. If we can do so, each compute node would see all products that are related to the sales that sit on this compute node and therefore answer the query without needing any data from other nodes (we would call this a distribution compatible query). However, the date is not a column in the product table (this wouldn’t make sense) so we cannot distribute the products in this way. It would work, if we had a SalesOrderHeader and SalesOrderDetail table, both joined and distributed on a SalesOrderID. But for the product dimension (as for most other dimensions too) we can go for a more straightforward approach. Fortunately, in a data warehouse, dimensions usually contain very few rows compared to the fact tables. It’s not uncommon to see over 98% of the data in fact tables. So for the PDW, it would make no difference if we put this table on all compute nodes. We call this a replicated table, while the Sales table itself is a distributed table. By making the Sales table distributed and the dimensions replicated, each compute node can answer queries that filter or group on dimension table attributes (columns) autonomically without needing data from other compute nodes.

 

image

Of course, this is just a very brief overview to show the basic concept. If we need to scale the machine, we could add more compute nodes and (after a redistribution of the data) can easily benefit from the higher computing power. This means we can start with a small machine and add more nodes as required which gives a great scalability. For the PDW v2 you can scale from about 50TB to about 6PB with a linear performance gain.

Also, PDW v2 offers a lot more features. Especially I’d like to mention the clustered column store index (CCI), which is a highly compressed, updatable in-memory storage of tabular data. Together with the parallel processing of the compute nodes this gives an awesome  performance when querying data from the PDW. Also the seamless integration with Hadoop (via PolyBase) allows us to store unstructured data in a Hadoop file system and query both sources transparently from the PDW in the well known SQL syntax without IT needing to transfer the data into the relational database or to write map-reduce jobs.

Again, there is much more to read about the PDW. A good starting point is the PDW website: http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx

 

Conclusion

Data Warehouses with large amount of data have challenges that go far beyond just storing the data. Being able to query and analyze the data with a good performance requires special considerations about the system architecture. When dealing with billions of rows, classical SMP machines can easily reach their limit. The MPP approach, that distributes data on multiple, independent compute nodes can provide a robust and scalable solution here. The Microsoft Parallel Data Warehouse is a good example for this approach that also includes features like in-memory processing (clustered column store index in v2) and a transparent layer on both structured and unstructured data (Hadoop).

  • Share/Bookmark

What’s the buzz about MPP Data Warehouses (part 1)?

Hilmar Buchta

PDW v1/v2

In the context of more and more data and the need of being able to analyze this data, you might also have stumbled over the MPP approaches for large data warehouses. MPP stand for massively parallel processing in contrast to SMP which means symmetric multi processing. A good definition both worlds can be found here: In an SMP machine, you usually have multiple CPUs which are sharing memory (RAM, disks) and are therefore well suited to boost performance on CPU critical tasks while in an MPP machine you also have multiple CPUs, but this time, each CPU has it’s own memory. Therefore MPP systems are better suited for a workload where you need a very high throughput of data. And this is what we typically see in data warehouses. Here we need to load large amount of data and we need to efficiently query large portions of this data.

But wait, comparing DWH and OLTP, I’m thinking of the following situation:

image

The main difference between an OLTP and a DWH solution is the data model and not the underlying hardware or database server software. Or, in other words, a modern database server should be suitable for both work loads, OLTP and DWH. The data model (database layout) however, differs a lot: In case of an OLTP database, we want to reduce redundancy and therefore build the data model as by normalizing the data to transactional and reference data and potentially complex relationship between tables. On the other hand, in a DWH model we want to be able to read large amounts of data with simple queries and therefore prefer a de-normalized model (star schema).

And as long as we do not have too much data, this point of view works fine. It’s surprising that it does, as both use cases have different requirements to the underlying infrastructure.

OLTP-System

  • Usually small batches of data (transactions), usually structured in a complicated way (covering many tables)
  • Needs to be able to roll back changes spanning multiple tables (complex transactions)
  • Ensure data integrity (foreign keys, other contraints)
  • Ensure simultaneous read/write access of users to the same amount of data (isolation level)
  • Support rich programming features (triggers, user defined functions etc.)

DWH-System

  • Load large amounts of data at specific loading times
  • Query large amounts of data (often in “full scan”), create aggregates

But still, for small amounts of data, you don’t have to consider the infrastructure too much. However, as the amount of data and its complexity grows, you have to think about ways of optimizing your data warehouse architecture:

image

The first step is to apply best practices for your data warehouse model. For example, loading large amount of data is not a good idea if you are having active foreign key constraints or – even worse – triggers – on your tables. But having a feature in a database software, does not necessarily mean, that you have to use it. So, here are some of these best practices:

  • Avoid active foreign key constraints when loading a large amount of data
  • use table partitioning and partition switching for updates rather than individual row insert/update processes (for example: late arriving facts)
  • Avoid granular transaction logging (simple recovery model)

On this step, you didn’t really touch the data warehouse system infrastructure at all. So your database server is still “universal”. On the next step of complexity, usually we start tuning the machine itself, for example

  • Choose a specific layout for your IO (SAN, RAID)
  • Choose a specific distribution of database files and file groups (log, temp etc.)
  • Use specifically tuned machines for the different tasks, like staging, ODS, data warehouse, data marts
  • Use server clusters to balance workload and provide high availability

At this step, the SQL Server becomes more and more optimized for data warehouse workload. It will be possible to run OLTP workload too, but this maybe less efficiently, as we started to optimize for DWH workload.

However, as the amount of data grows, one question comes in to mind: Wouldn’t it be better, to really optimize the database server for DWH workload? And consequently, don’t consider OLTP requirements as we do this optimization? This will offer different ways of storing and handling the data. If we follow this path, we get an infrastructure that might not be suited for OLTP traffic at all, but perfectly supports large loads and fast reads of very large data.

image

MPP data warehouse solutions, like Teradata, Oracle Exadata, IBM Netezza and Microsoft Parallel Data Warehouse or Greenplum are examples of these approaches. Usually, the approach is a shared nothing MPP architecture of nodes, which have their own segment of data on their own disks (not a shared memory or disk). Most consequently all components (including the hardware) are perfectly tuned and aligned for this purpose. To achieve this, pre-installed and configured appliances are commonly used, so instead of buying hardware and software individually and trying to make it run well and fast, you get a “black box” (i.e. one or more racks) of components and software that are selected and configured in the best possible way.

In part 2 of this post, I’ll show the basic ideas of this shared nothing architecture and how query performance can benefit from the distribution of data on several compute nodes.

  • Share/Bookmark

When data gets big

Hilmar Buchta

BI

So this post is about big data. When looking around on the internet, you can find amazing examples of big data. For example, the Large Hadron Collider of the CERN generates about 15 Petabytes of data every year (see http://www.lhc-facts.ch/index.php?page=datenverarbeitung for details). This is about 41 Terabyte each day. Impressive. However, you might argue that you don’t have such a collider in your company. In fact, most companies will only have to deal with a very small fraction of this amount of data. So where does big data start for common business applications? And what does it mean for the IT strategy. Does it have an influence or is it just a matter of scaling and improving systems – a process that we always have to do in IT to keep up with business requirements.

Wikipedia defines big data as “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” (http://en.wikipedia.org/wiki/Big_data). And I think that this definition is a good starting point because it focuses not only on the amount of data but puts that amount in relation to what we do with the data (process). Let me paraphrase this definition:

I’m talking of big data if my analytical capabilities are no longer sufficient because of the amount or complexity of the data and if I’m not able to scale this capabilities using traditional approaches (more RAM, more servers in my clusters, more disks etc.). It’s not difficult (and not expensive) to store Petabytes of data. It’s difficult to process that data and to do analytics on this data and to gain insights.

So, to be honest, even a few hundred million rows of data may be big, if I’m not able to perform the important analytics, that I need to supply my core business processes in a timely manner. And there are two things to keep in mind:

  • Modern ideas of modeling markets and complex statistical models and methods are available.
  • While it may be difficult to apply these methods, maybe our competitors already do.

Also, another aspect about analytical capabilities is the question “Do I have the right data or do I need other data sources?”. Limits in analytical capabilities may also exists because I don’t have the information I would need. In todays world with lots of data markets (like Microsoft Azure Datamarket, http://datamarket.azure.com/), it’s reality that you can get information/data that you might not even have dared dreaming about a few years ago. Now you get data about consumer trends, your competitors or global trends and you get this data in a reliable, accurate and up-to-date way. Again, this increases the amount of data you need to process and by that, may worsen the analytical restrictions that result from the pure amount of data.

But then, this still is nothing new. As I mentioned before IT had to follow these requirements during each of the last years. We added more cores, bought newer machines. The database algorithms improved, we used OLAP and other technologies to speed up analysis. But let me get back to the second half of the definition from above: “it becomes difficult to process using on-hand database management tools or traditional data processing applications”. If you like, you may replace “difficult” with “makes it more expensive” or “costs more afford” or – in some cases – “makes it impossible, at least for the required period of time”. For example, if you want to calculate a complex price elasticity model in retail and it takes you a month to do so, the result will not be useful anymore as the situation in your market might have already changed significantly (for example because of your competitors’ campaigns).

Again, this is not really new. During the past you may have added other components in your IT strategy, for example OLAP. Or you have replaced a slow database solution with a faster one. And you focused on scalability in order to cope with these challenges. So, you might look at some typical components of a big data environment in just the same way. Here are some of them (be careful, buzzword mode is switched on now):

  • Hadoop, Hive, Pig etc.
  • In-Memory Computing
  • Massive parallel computing (MPP).
  • Cloud
  • Complex event processing (CEP)
  • Etc.

[buzzword mode off] However, if you think of these components in a traditional way of enhancing the IT infrastructure you might think in the wrong direction. The main thing about big data is, that when you get to the limits of your analytical capabilities (as in the definition from above) there are almost always tools and methods to get beyond those limits. However, these tools may require some fundamental changes in the IT ecosystem. As for MPP databases, for example, it’s not done getting one and putting all the data on it, but it is about re-shaping the BI architecture in order to match the new paradigm of those systems.

image

During the next posts, I’ll get a little bit deeper in this topic, the fundamental changes in the Big Data Architecture and especially MPP databases.

  • Share/Bookmark

Dashboards, Scorecards, KPIs… are they really useful?

Hilmar Buchta

Business Intelligence

When talking about dashboards, scorecards and KPIs sometimes a discussion emerges about the usefulness of all this. One common argument against these management tools is, that they are not flexible enough and that they cannot really cover the most important aspects of business. An important nature of a KPI is, that you can measure it. However, the important aspects of management seem to go far beyond measurable indicators. For example, the invention of a new kind of product or service or extending your business to another country or market. These are strategic decisions, ideas, inventions. How can we put them into KPIs? KPIs seem to be exactly the opposite to these strategic ideas. In fact, a typical thesis about KPIs is

“If your KPIs are green you can go golfing”

Almost every manager I talked to disagrees on that. There is no such thing like a KPI that indicates that there is no need for action. As we are in a competitive market, there is always the need to think about future improvements, new market strategies and plans. And even if somebody is determined to work with KPIs, the next discussion evolves about questions like “do they need to be real-time?”, “do I need them on my mobile device?”.

But let’s start at the beginning. We have to concentrate on the aspects of management before we can decide how these aspects can be supported by measures or KPIs. Looking at your task list, it should be possible to arrange all the tasks on the following board:

image

While there is common agreement that management should delegate topics in squares 1 and 3 (not important), the role of the other two squares is essential. To make it short, the square which deserves the highest focus by management is square number 4, tasks which are not (yet) urgent but very important. It is this square where you find the invention of new ideas, new products, the development of your staff etc. It is so important because this will ensure your position in the future. If management doesn’t work on these topics, nobody does. However, in daily business it’s square number 2 that often disturbs this creative process.

With a little bit of simplification we can concentrate the work of a manager on those two squares:

Square 2: Operational aspects

Square 4: Strategic development of the operational unit

Looking at those two squares you find tasks that are very different:

Operational Strategic
Repeating, well defined New, inventive
Urgent, high impact on operational processes Not urgent, but will become crucial in the future
Clearly defined measures Hard, if not impossible to measure

Of course, this is a very broad simplification, just to make the point. Strategic decisions lead to tactical tasks which can then be tracked and measured and/or detailed to operational decisions.

KPIs can easily be set up for operational or tactical topics. And for these two aspects it’s also important that they are near-time, always available and support some kind of push alerts. Think of a manager being responsible for a certain step in production. Strategically the manager will think about better ways of doing this production step, maybe methods which use less energy or improve production for a higher quality. But if there is a failure in the machine and production stops because of this, we have an escalation which deserves the full attention of the manager.

Therefore, I would like to modify the statement from above to

“If your operational KPIs are green you can take your time to think about the strategically important aspects of your business

In this sense, KPIs are very useful to get a quick overview about the operational topics you are responsible for. Such KPIs could cover production output, number of complaints in a call center, quality measures, automatic testing procedures etc. And now imagine you have these KPIs on your mobile device in real (or near) time. So you are informed about operational problems long before any of your managers notice them. This enables you to take counter measures and to provide answers instead of questions. Having such KPIs gives the power to act instead of reacting. And once the KPIs go back to green again you can concentrate on the strategically important aspects of business.

I’m not saying that KPIs don’t make sense for the strategical topics at all. But they are on a more long-term basis. Having them on a mobile device for example doesn’t give much benefit. If they are not urgent you can also monitor them the next time, you are at your desk. With operational KPIs you might not have this time.

So, having access to operational KPIs and alerting is important for management. How can we put these KPIs on scorecards and dashboards? One common mistake is the believe that these KPIs are “static”. Sometimes dashboards are created using reporting tools. People decide which KPIs are important and those are developed into a “dashboard”. Things are even worse, if developers are needed in order to change KPIs. For example, think of a product manager being responsible for a new product.

Situation KPIs needed
The product needs to be developed in time for the exhibition in fall. Only six months to go… Development progress of the product
Development progress of marketing material
Six months later, at the exhibition, there is a lot of interest in the new product. Many people are placing orders Monitor the delivery time of the product
Monitor the production rate of the product
Two months later, there is a unusual high rate of complaints. Obviously, there are quality issues. After investigation, the source of those issues is found. Monitor the complaint rate
Monitor the guarantee process
plus:
Monitor the delivery time of the product
Monitor the production rate of the product
Again four months later: The quality issues are solved, but sales are decreasing. The reason is that a competitor has developed a similar product at a lower price. Your answer is increase marketing and to reduce the price while working on new features in the product to come up with a more competitive product soon. Monitor the marginal return of the product
Monitor sales rate
Monitor development of new features
Monitor competition

I think it’s clear, that a useful dashboard/scorecard has to adapt to the needs of business in a timely manner. While the dashboard of a car is simple and mostly static, a business dashboard is not. If such a dashboard is developed using reporting tools it’s likely that your users will only use it for a few weeks or months. After that, it doesn’t reflect the business needs any more and needs to be adjusted. Even for car dashboards we find more flexibility nowadays. If you’re running out of fuel, the car’s gps gives you advice on nearby gas stations. When approaching your destination, the car informs you about parking facilities. So even the static dashboard of a car becomes more and more dynamic and case sensitive. Why should a less flexible dashboard work for business?

 

Conclusion

KPIs, Scorecards and Dashboards based on operational measures are adding a true benefit to the work of a manager. If a KPI turns red, immediate action is necessary. Even if the situation is already under control, the manager needs to be informed and potentially needs to inform others. Therefore, these KPIs have to be real time / near time and have to be available everywhere (mobile devices). And having these KPIs in status “green” does not mean, the manager has nothing to do, but that the manager has time to concentrate on the most important aspects of work: The strategic development of the operational unit and it’s staff members. But it’s important that these KPIs are flexible – as flexible as business is. Therefore reporting tools are not a good idea of building dashboards unless the set of KPIs can easily be adjusted using self-service tools. A more promising approach is the use of Visual-BI tools, that allow end users to easily adopt dashboards and visualizations without the delays resulting from round trips to IT and development.

  • Share/Bookmark

DAX vs. MDX: An example

Hilmar Buchta

SQL Server 2012

I’m often asked about the key differences between DAX and MDX or in more general, the difference between the tabular and the multidimensional model. From my presentation on the PASS Camp 2012 in Germany I’d like to share some of the ideas here.

From the perspective of an expression or query language, one of the most important differences is the concept behind both approaches.

For a cube, we have the concept of a tuple for addressing a cell in the cube space. The axis in the tuple are setting the coordinates. If we have a single tuple, the result is the content of the corresponding cell in the cube. As cube attributes have an All-member which serves as the default (in most cases) if the attribute is omitted from the tuple, we also get aggregated this way. For example, a tuple like

(Date.Calendar.[Calendar Year].&[2013], Measures.[Internet Sales Amount])

returns the (aggregated) sales amount for the year 2013. Other attributes (for example Product) are on their default members. As you see, there is no need to supply an aggregation function in this case (although MDX has such functions when aggregating values over sets) as the cube contains the definitions about how to aggregate the sales amount. The following sketch illustrates this way, of addressing a value in a cube:

image

For the tabular model, filters within our pivot table work like filters on the underlying tables. Even with a single selected value, multiple rows can be included in the filter of a table. For example, if you filter the Year the 2013, the underlying date table will be filter to all 365 days of that year. For a given measure all filtered tables are intersected giving the detail values that correspond to the filter. As there are potentially many rows of data after this process, an aggregation function is required to compute the aggregate. This is shown in the following illustration:

image

If you are more familiar with SQL than with MDX, the concept of filtering and aggregating in DAX will be more familiar to you. In SQL, as in DAX, we’re usually restricting table rows (using the WHERE clause in SQL or the FILTER function in DAX). Then we create groups (using the GROUP BY clause in SQL or the SUMMARIZE function in DAX) and finally we’re computing the aggregates using an appropriate aggregation function (like SUM).

However, many questions that require this process in SQL or DAX can be solved in MDX just by addressing cells. Let my provide you with an example that I frequently use during my MDX trainings: The requirement is to create a computed measure that gives the sales amount of sales on weekend. If you have a strong SQL background your solution in MDX may look like this:

Aggregate(
    filter(
        descendants(
            [Date].[Calendar].currentmember,
            [Date].[Calendar].[Date]
        )
        ,
        [Date].[Day of Week].currentmember IS [Date].[Day of Week].[Sunday]
        or [Date].[Day of Week].currentmember IS [Date].[Day of Week].[Saturday]
    )
    ,[Measures].[Internet Sales Amount]
)

This seems to be a natural approach. Using the descendant function we create a set of all the dates on the selected date member (for example a month, a quarter, a year). We then filter those dates using the filter function, so that only Sunday and Saturday remains in the set. Finally we aggregate the measure sales amount along this set.

Actually this way of approaching this calculation is very similar to SQL or DAX. For example in DAX we would write the same calculation in almost exactly the same way:

evaluate(
    summarize(       
        filter(
            ‘Internet Sales’
            , related(‘Date’[Calendar Year])=2007
        )                   
        , ‘Date’[Month]
        , "Umsatz"
        , Sum(‘Internet Sales’[Sales Amount])
        , "UmsatzWE"
        , Calculate(
            Sum(‘Internet Sales’[Sales Amount])
            , Or(
                ‘Date’[Day Name Of Week]="Sunday"
                , ‘Date’[Day Name Of Week]="Saturday"
            )
         )
    )
)

Although this looks very similar to the MDX code from above, the MDX code above is close the most complicated solution available. Since the weekday is a cube attribute, we can simply address sales on weekends but using a tuple (ok, a sum of two tuples):

([Measures].[Internet Sales Amount], [Date].[Day Name].[Sunday])
+
([Measures].[Internet Sales Amount], [Date].[Day Name].[Saturday])

So when writing a DAX query, we rather think

  • How do I filter my underlying tables
  • What aggregation function do I need

In MDX on the other hand, we rather think

  • What axis do I have to address in order to pull my value out of the cube
  • Share/Bookmark

  • Kategorien

  • Copyright © ORAYLIS Blog. All rights reserved.