Computer Science 380:

Principles of Database Systems

Chapter 20

Gregory M. Kapfhammer


creative commons licensed ( BY-NC-SA ) flickr photo shared by danmachold

Data Warehousing and Mining

Big Data

In 2013, stored information in the world is 1,200 exabytes!

Less than 2 percent of this is non-digital in format!

Information grows 4 times faster than world economy!

At its core, big data is all about making predictions!

Five Trends in Big Data

More

Messy

Good Enough

Correlation

Datafication

What are the risks?

Decision-Support Systems

Broad classification into two categories ...

... transaction-processing or decision-support

Goal: get actionable information about of the details!

Decision-Support Challenges

Not all queries can be expresses in SQL

Query languages are not best for statistical analyses

Diverse data sources are not easily loaded into an RDBMS

Must combine expertise from a wide variety of areas

Components of a Data Warehouse

Data Gathering

How? Push or pull methods

What schema?

How to transform and cleanse?

How to handle data updates?

What data to summarize?

ETL: Extract, transform, load

Warehouse Schemas

Fact table: a central, large table

Dimension table: smaller tables for specific data

We can create star and snowflake schemas!

See Figure 20.2 for an example!

Column-Oriented Storage

When is this approach the best?

Scanning and aggregating multiple tuples!

But, not widely used for transaction processing. Why?

Storing and fetching a tuple requires many operations!

Data Mining

Analyzing large databases to find useful patterns

Descriptive patterns

Associations

Predictions

What kinds of algorithms can help?

Machine Learning

Supervised

Un-supervised

Reinforcement

Training and Testing

k-fold cross validation

Avoiding an "overfit"

Classification

Study the examples on page 895

Decision-tree classifiers

Classification and regression trees

What is the difference?

Review the example on page 896

Greedy and global searches for the best split

Other Types of Classifiers?

Trade-offs in the Choice of Classifier?

Evaluation of Classifiers

False positive

False negative

Accuracy and Recall

Precision and Specificity

Refer to page 903 for equations!

Association Rule Mining

Please see page 904 for an example!

How are these rules useful?

Can consider both rules and their deviations

Also consider deviations from temporal patterns!

Cluster Analysis

Hierarchical

Agglomerative

k-means clustering

How do you pick the distance function?

Why is this approach useful?

Text Mining

Natural Language Processing

Data Visualization

Let's Try It!

Installation

Enter the R environment by typing R

We have to install several packages!

Be ready to wait for awhile!

install.packages("formula")

install.packages("evtree")

install.packages("partykit")

Pick a CRAN mirror in PA

Did your installation work correctly?

Data Viewing

Let's load the data set!

data("BBBClub", package="evtree")

BBBClub

summary(BBBClub)

What are the exploratory variables?

What attribute are we trying to predict?

Can you find any patterns in the data set?

Predictions

Goal: build a predictive model of customer choice

Can we know when a customer will purchase the book?

Do you have any intuitive predictions?

How did you form those predictions?

Are those predictions actually correct?

Let's use machine learning to predict!

Library Loading

library("rpart")

library("partykit")

library("evtree")

Running each command produces no output!

Ready to try the next step?

Summarizing

summary(BBBClub)

How many attributes are there?

What are the attributes?

What trends do you find in the data set?

Does it support your intuitive prediction?

Can you make any additional predictions?

Rpart Training

rp <- as.party(rpart(choice~., data=BBBClub, minbucket=10))

rpTwo <- as.party(rpart(choice~., data=BBBClub, minbucket=10, maxdepth=2))

Let's review the components of this command

Okay, now we need to train with two other methods!

Ctree Training

ct <- ctree(choice~., data=BBBClub, minbucket=10, mincrit=0.99)

ctTwo <- ctree(choice~., data=BBBClub, minbucket=10, mincrit=0.99, maxdepth=2)

You can learn more about these methods by typing ?rpart and ?ctree

Any questions so far?

Evtree Training

ev <- evtree(choice~., data=BBBClub, minbucket=10, maxdepth=2)

Can you compare the execution times?

Differences between global and local search methods

Consider running this command to make multiple trees!

Are the trees any different?

Visualization

X11(height=6, width=8)

plot(ev)

Create a visualization for each of the trees

Make sure that you have a separate X11 window!

Compare and Contrast

Understanding the Trees

Making Predictions