Gregory M. Kapfhammer
In 2013, stored information in the world is 1,200 exabytes!
Less than 2 percent of this is non-digital in format!
Information grows 4 times faster than world economy!
At its core, big data is all about making predictions!
Broad classification into two categories ...
... transaction-processing or decision-support
Goal: get actionable information about of the details!
Not all queries can be expresses in SQL
Query languages are not best for statistical analyses
Diverse data sources are not easily loaded into an RDBMS
Must combine expertise from a wide variety of areas
How? Push or pull methods
How to transform and cleanse?
How to handle data updates?
What data to summarize?
ETL: Extract, transform, load
Fact table: a central, large table
Dimension table: smaller tables for specific data
We can create star and snowflake schemas!
See Figure 20.2 for an example!
When is this approach the best?
Scanning and aggregating multiple tuples!
But, not widely used for transaction processing. Why?
Storing and fetching a tuple requires many operations!
Analyzing large databases to find useful patterns
What kinds of algorithms can help?
Training and Testing
k-fold cross validation
Avoiding an "overfit"
Study the examples on page 895
Classification and regression trees
What is the difference?
Review the example on page 896
Greedy and global searches for the best split
Accuracy and Recall
Precision and Specificity
Refer to page 903 for equations!
Please see page 904 for an example!
How are these rules useful?
Can consider both rules and their deviations
Also consider deviations from temporal patterns!
How do you pick the distance function?
Why is this approach useful?
Enter the R environment by typing
We have to install several packages!
Be ready to wait for awhile!
Pick a CRAN mirror in PA
Did your installation work correctly?
Let's load the data set!
What are the exploratory variables?
What attribute are we trying to predict?
Can you find any patterns in the data set?
Goal: build a predictive model of customer choice
Can we know when a customer will purchase the book?
Do you have any intuitive predictions?
How did you form those predictions?
Are those predictions actually correct?
Let's use machine learning to predict!
Running each command produces no output!
Ready to try the next step?
How many attributes are there?
What are the attributes?
What trends do you find in the data set?
Does it support your intuitive prediction?
Can you make any additional predictions?
rp <- as.party(rpart(choice~., data=BBBClub, minbucket=10))
rpTwo <- as.party(rpart(choice~., data=BBBClub, minbucket=10, maxdepth=2))
Let's review the components of this command
Okay, now we need to train with two other methods!
ct <- ctree(choice~., data=BBBClub, minbucket=10, mincrit=0.99)
ctTwo <- ctree(choice~., data=BBBClub, minbucket=10, mincrit=0.99, maxdepth=2)
You can learn more about these methods by typing
Any questions so far?
ev <- evtree(choice~., data=BBBClub, minbucket=10, maxdepth=2)
Can you compare the execution times?
Differences between global and local search methods
Consider running this command to make multiple trees!
Are the trees any different?
Create a visualization for each of the trees
Make sure that you have a separate X11 window!