New R course on Coursera: Data Analysis and Statistical Inference

Yesterday (Monday 1st of September), a new session of Data Analysis and Statistical Inference, taught by Doctor Mine Çetinkaya-Rundel from Duke university, has started on Coursera. Just like with the previous run, all labs take place in DataCamp’s interactive learning environment.

Data Analysis and Statistical Inference will teach you how to make use of data in the face of uncertainty. Throughout the course, you will learn how to collect, analyze, and use data to make inferences and conclusions about real world phenomena. No formal background is required, but mathematical skills are definitely a plus.

coursera_logo_RGB

The course makes intensive use of R for its statistical computing; its corresponding interactive exercises, available on DataCamp, were developed in close collaboration with doctor Çetinkaya-Rundel. In sum, this course is perfectly tailored to your needs if you are a starting data scientist and you are looking to expand your basic statistical knowledge.

We hope to welcome you in our online classroom soon.

P.S. In case you prefer to complete the course self paced in your own time, we recommend you to have a look at the open intro course. Here you can find similar material that is also supplemented with the interactive exercises of DataCamp.

Share Button

Coursera course on computational finance with R

As of today (Tuesday 26th of August), a new session of Professor Eric Zivot’s course on computational finance and financial econometrics starts on Coursera. Just like the previous run of the course, most R labs and R assignments will take place in DataCamp’s interactive learning environment.

Designed by Professor Eric Zivot (University of Washington), Introduction to computational finance focuses on mathematical and statistical tools and techniques that are used in quantitative and computational finance. With the help of real-life examples, you will be introduced to the dos and don’ts of financial data analysis, estimations of statistical models and the construction of optimized portfolios. The course requires no formal background, but some basic mathematical skills will definitely come in handy.

zivot

DataCamp’s interactive R exercises are developed in close collaboration with Professor Zivot himself.  They therefore have the same high-quality standards as academic courses, but presented in DataCamp’s fun and learning-by-doing environment. All students that choose to enroll for the course on Coursera will be directed to DataCamp to practice their skills and to complete assignments.

If you always wanted to learn more about computational finance, or if you are just interested in doing financial econometrics with R, this course is a must-do for sure. We hope to welcome you in our online classroom soon!

PS. In case you prefer to only do the interactive exercises, the course is also available on DataCamp as a stand-alone version which does require prior knowledge about finance and R.

Share Button

Package rankings, task view rankings and much more: the Rdocumentation poster

Especially for useR! 2014 we created a poster on Rdocumentation.org, our R documentation aggregator that lets you search packages from CRAN, Bioconductor and GitHub. We received a lot of positive and valuable feedback on it, so we decided to share it again via our blog.

Screenshot 2014-06-28 16.36.26

The Rdocumentation poster answers questions such as:

  • What are the 10 most downloaded R packages of all time?
  • Who maintains the most  packages?
  • What are the most popular task views, and how did this ranking change over time?
  • How popular is GitHub in the R community?

So if you  did not make it to the useR! 2014 conference, or in case you could not attend the Wednesday evening poster session, you hereby have another chance to have a detailed look at the poster.  Feedback and questions can be sent to team@datacamp.com.

Feel free to  share it with your network!

On Rdocumentation.org: Rdocumentation is a web application that helps you to easily find and browse the documentation of R packages on CRAN, Bioconductor and GitHub. It enables you to instantly search for functions and use advanced search on the documentation of all R packages.

Share Button

Who wants to disrupt R training and R education?

Using EdTech applications as an academic, R trainer or training company is no longer a unique selling proposition, but a must-have commodity. In this post, we introduce the new DataCamp course creation tools for academics, trainers and enterprises, and make a call to those who are interested in using these tools. More information via team@datacamp.com.

Everyone that is involved in academic teaching or professional training is experiencing how a new wave of online education tools is changing the way things are done inside and outside the classroom. Using EdTech is no longer a differentiator, but a must-have.  One must dare to go beyond the standard offering of webinars and the traditional collection of online instruction videos. All these new tools are disrupting the current business models around trainings, as well as the educational and pedagogic models used today.

New Features

For over a year now, DataCamp has been working on building tailored and scalable EdTech tools for R education and training. We do these developments for our own R courses and tutorials -in 2014 alone we already trained over 42,000 new R enthusiast- and for academics, trainers and companies using R in a narrow or a broad sense. In the past months, we have been working hard on some serious improvements to our course creation tools and today we make these new developments available to the public. You can now:

  • Integrate video material and add slides,
  • Write better and more complex submission correctness tests (more on that in a follow-up blogpost soon),
  • And – upon request- set up private learning environments for your students and clients and track their individual performance. (We’ll make sure to cover the technicalities of these new features in more detail in one of our next posts.)          

Furthermore, we have created a whole new FAQ section that answers questions on the course creation process itself, how to track student and employee performance, ways to use DataCamp for your courses, books and trainings,  our offer for tailored online and live trainings, and much more. Additionally, to help future course creators, we have developed our own course creation style guide based on the style guide published by Hadley Wickham.

icon

For academics, trainers and enterprises

With these new functionalities and tools we want to meet the new needs and challenges of:

  • academics that want to complement their academic lectures with new and exciting interactive exercises,
  • trainers and training companies that need to adapt their online and live course portfolio to the changing business environment,
  • package authors and R enthusiasts that want to create their own online course on their favourite topic or package,
  • and (large) enterprises in need of cost-effective, but yet tailored and scalable trainings.

If you are an academic that is interested in using DataCamp, a professional trainer or training company that is ready to integrate EdTech into your course portfolio, or a company in need of high-quality cost-effective R training, contact us at team@datacamp.com or go to our teaching site.

Share Button

Including GitHub and Bioconductor on Rdocumentation: Technical Details

In our last blog post we announced the addition of GitHub and  Bioconductor R packages to Rdocumentation. For the more technical amongst you, I’ll give a short, high-level description of what’s under the hood of Rdocumentation. Along with that, I’ll zoom in on some of the challenges that I encountered while adding GitHub and Bioconductor repositories.

rdoc

Rdocumentation in a (technical) nutshell
In a nutshell, the Rdocumentation web server communicates with an R server that’s running in the background. Using a cron job, this R server executes the following steps on a daily basis:

  1. Check for all available packages and their version numbers using available.packages().
  2. Compare these with the ones on Rdocumentation.
  3. Install/update the ones that are out of sync.
  4. Generate the documentation for the newly installed/updated packages and store it in a zip file.

The Rdocumentation web server then picks up the newly generated documentation from the R server, parses it, and stores it in its database.

This setup effectively creates a fully automated documentation service. However, installing all R packages on a single machine is by no means a trivial task. Many packages depend on certain (often Linux-specific) libraries such as C++ header files from various develoment packages. These dependencies cause installation failure and require manual intervention. We hope to get to the point where we can run a setup procedure on a server to prepare it for installation of all R packages, but for now this is a work in progress. Another problem is that when R updates, many packages break on installation. We’ve opted to ignore this for now, and to not update packages that don’t install on R’s latest version.

Adding GitHub and Bioconductor repositories
The first version of Rdocumentation only included the packages available on CRAN.  Our latest update expanded the package portfolio with the available packages on Bioconductor and GitHub.

Implementing Bioconductor packages was very similar to implementing CRAN packages, but with a few caveats. The biggest one to overcome was that Bioconductor packages sometimes download massive datasets (> 1GB) upon installation, which makes installing and updating a very time consuming and storage space consuming task. To overcome this, we used the `parallel` package to run package installations in threads that were killed (with a SIGKILL signal to the process) if they didn’t terminate after some time. This way we avoided cluttering our machine, and the few packages we loose with this technique is worth the performance gain.

Adding GitHub support was very different. Credits go to Hadley Wickham’s r-on-github script. His script uses the GitHub api to search for all R repositories and their details (owner, stars, latest update, etc.). We only made some minor changes to his script to filter repositories on the amount of stars that they have, this to cut out the many test repositories. The following graph plots the amount of R repositories based on the amount of stars that they have.

Rplot

We decided that 3 or more stars was an acceptable metric to decide that a repository is “popular enough” for Rdocumentation. An arbitrary measure, but given the amounts shown in the graph above it seems that even taking 1 or more stars already discards the big majority of repositories. Once the repository information is collected, install_github() from devtools is used to install all of the packages on the server. After an initial install of all packages, only packages that have been updated/created within the last week on GitHub are considered for obvious performance reasons.

Any questions/remarks? Drop me a line at bram@datacamp.com

Share Button

New! Search GitHub and Bioconductor packages on Rdocumentation

As of today, you can search Rdocumentation not only for CRAN packages, but also for the R packages available on GitHub and Bioconductor. This is our largest update yet, and brings us one step closer to creating  a central place for all R documentation related info and questions.

Today, the rise of alternatives to CRAN package management system (Bioconductor, GitHub, …) can make finding and installing packages tedious. Documentation for CRAN packages is on cran.r-project.org, documentation for Bioconductor packages is on bioconductor.org, etc.

In addition, there is the current tendency of package developers to no longer (immediately) release their packages on CRAN, but to make use of GitHub. Just think of well-known packages such as ggvis and slidify that are only available on GitHub.  While the most popular GitHub R packages are passed on by word of mouth,  many good packages that are not on CRAN remain relatively unknown. Wouldn’t it be nice to be able to have an overview of all these packages that remain obscure in their GitHub repositories?

blog_git_bio

With this latest update, we aim to address these problems. Rdocumentation now supports automatic adding and updating of packages from both Bioconductor and GitHub, making us effectively the first R documentation aggregator that combines these sources into one searchable website.

If you have other suggestions, feel free to contact us via info@datacamp.com.

Share Button

Who wants to learn R? Sharing DataCamp’s user stats and insights.

When one builds an online education start-up for R, the number one criterion to meet is the following: identify an increasing interest in learning R online. Once this box is checked, it is time to start thinking of the second most important criterion: establish a teaching approach that makes people so excited that they keep coming back to learn more, thereby turning them, slowly but surely, into black-belt R masters.

In order to investigate how DataCamp is performing on both criteria, we decided to analyze our user data for February in more detail, and to open up and share the results via this (comprehensive) Slidify presentation. We put some effort in the visualizations as well, so all results are prettified via rMaps, rCharts and googleVis. (For the curious souls among us, the presentation also gives a unique view on the status of DataCamp back then.)

Screenshot 2014-05-01 23.53.22

For DataCamp, February has been one of the most interesting months so far in terms of user data, as we added two new and free online interactive courses to our curriculum: Data Analysis and Statistical Inference and Introduction to Computational Finance. Courses that are/were also used as interactive R complements to the like-named Coursera courses. In February we welcomed over 14,000 new R enthusiasts, from a total of 163 countries. Our servers handled peak traffic of 1,000 requests per minute, and hundreds of concurrent users. Other insights that you will find in the presentation are:

  • Number of chapters started and finished by course
  • Geographical distribution of the DataCamp user base
  • Spillover effect across courses

Make sure to have a look, and if you want more information, send your requests to info@datacamp.com.

Share Button

Decimal comma or decimal point? A googleVis visualization

As you all know, the decimal mark is a symbol used to separate the integer part from the fractional part of a number written in decimal form. Since I was born and raised in Continental Europe, I am quite fond of using the comma sign to indicate a decimal point. I’ve grown up with it, encountered this comma in both my literary and numerical escapades, and still shudder when thinking of its dual role in long divisions.

However, using the comma sign as a decimal mark has two consequences:

  • Since the comma “,” sign is already taken to mark the radix point, I’m obligated to use the dot “.” sign to separate the thousands.
  • As the comma “,” sign is only used by 24% of the world’s population, 76% of the people in the world are creating documents, writing texts, typing in working in spreadsheets, etc. that contain numbers with a fractional part that looks different from mine.

While the first consequence is the inevitable sacrifice one must make for having the privilege to use the comma for fractional numbers, the second is a much harder obstacle to deal with in the real world of professional number crunchers. More than necessary, I find myself struggling with sheets and documents that use the dot “.” sign to mark the radix point instead of my beloved comma. Why that often? Based on Wikipedia, roughly 60% of the world’s populations uses the dot “.” sign to mark the radix point. And when looking for these numbers, I learnt that there are even more dissidents. In the Arab world they use the Arabic decimal separator for Eastern Arabic numerals, in Persian the decimal mark is called momayyez, and in English Braille the decimal mark even has its own sign…

Since politely asking these people to change their disrupting behavior will most likely be hopeless, and finger pointing is only a prerogative of the majority, I was forced to make myself a little tool to guide me through of what at the beginning looked like an insurmountable problem. Using the data I found on Wikipedia and with the help of the googleVis package, I created a world map that indicates the decimal separator that is used in each country. That way, depending on the origin of the sheet I receive, I always know what decimal mark to expect.

The map (full-size) is not yet complete (mainly in Africa there are some blank spots left). So for those that are aware of the prevailing decimal mark culture in these regions, just let me know in the comment section and I’ll make sure to update them in case of a comma, or to proselytize them otherwise ;-). You can find the original code here.

Share Button

Statistical Language Wars: The Infograph

A feature that all programming communities have in common is the numerous debates about why their programming language of choice is better, more advanced, faster, holier etc. In today’s data science community, it seems as if these discussions are omnipresent with advocates of SAS, SPSS, R, Python, Julia, etc. battling and challenging each other on every online medium. (side note: These ‘data driven’ debates are often a good example of how you can prove anything with statistics.)

While these debates are a good thing for the community and the programming language as a whole, they unfortunately also have a negative effect on those individuals that are just in the beginning of their data analytics career. Biased opinions on all sides of the table make it difficult for new data analysts to see the forest for the trees.

Especially for this new group of data analysts (and future debaters), as well as for everyone else that is interested in learning data science or an additional statistical language, we created the infograph ‘Statistical Language Wars’ that gives a basic comparison between SAS, R and SPSS to see how they stack up. This to provide a more clear starting point.

Statistical language wars: SAS vs R vs SPSS
Source: blog.datacamp.com

We’ll make sure to regularly update this infograph based on the feedback you provide, and we will definitely consider to create some new infographs that focus more on other players such as Python and Julia.

Feel free to share!

Embed Code:


<a href="http://blog.datacamp.com/statistical-language-wars-the-infograph/" ><img src="http://datacamp.wpengine.com/wp-content/uploads/2014/05/infograph.png" alt="Statistical language wars: SAS vs R vs SPSS" /></a><br/>Source: <a href="http://blog.datacamp.com">blog.datacamp.com</a><br/>
Share Button

MathJax binding in Angular.js

Introduction to Mathjax

MathJax is an open source JavaScript framework that makes formatting mathematical formulas easy. We use MathJax for rendering the many statistical formulas in the description of the DataCamp exercises.
Mathjax converts a string like:

[ left( sum_{k=1}^n a_k b_k right)^2
leq left( sum_{k=1}^n a_k^2 right)
left( sum_{k=1}^n b_k^2 right) ]

to

MthJax Formula

Combining MathJax with Angular.js

At DataCamp, we use Angular.js as our JavaScript MVC framework to create the feeling of a “single-page application”. However, one of the downsides of using such an MVC framework is that it’s not always easy to couple it with other JavaScript frameworks like for instance MathJax.

Angular provides databinding between the model and the view (more info here).  The goal is to create something called a directive, that automatically renders the MathJax syntax when the model changes and the view is updated.

Step 1 : Include the MathJax source script

<script src="https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

Step 2 : Create the Directive

This code will create a mathjax directive that watches for any change on the ng-model attribute. This means that if a controller changes the $scope.mjx property, MathJax will automatically rerender the changed element.

If you have any questions or suggestions, feel free to let me know at dieter@datacamp.com

Share Button