Including GitHub and Bioconductor on Rdocumentation: Technical Details

In our last blog post we announced the addition of GitHub and  Bioconductor R packages to Rdocumentation. For the more technical amongst you, I’ll give a short, high-level description of what’s under the hood at Rdocumentation. Along with that I’ll zoom in on some of the challenges encountered while adding GitHub and Bioconductor repositories.

rdoc

Rdocumentation in a (technical) nutshell
In a nutshell, the Rdocumentation web server communicates with an R server that’s running in the background. Using a cron job, this R server executes the following steps on a daily basis:

  1. Check for all available packages and their version numbers using available.packages().
  2. Compare these with the ones on Rdocumentation.
  3. Install/update the ones that are out of sync.
  4. Generate the documentation for the newly installed/updated packages and store it in a zip file.

The Rdocumentation web server then picks up the newly generated documentation from the R server, parses it, and stores it in its database.

This setup effectively creates a fully automated documentation service. However, installing all R packages on a single machine is by no means a trivial task. Many packages depend on certain (often linux-specific) libraries such as C++ header files from various development packages. These dependencies cause installation failure and require manual intervention. We hope to get to the point where we can run a setup procedure on a server to prepare it for installation of all R packages, but for now this is a work in progress. Another problem is that when R updates, many packages break on installation. We’ve opted to ignore this for now, and to not update packages that don’t install on R’s latest version.

Adding GitHub and Bioconductor repositories
The first version of Rdocumentation only included the packages available on CRAN.  Our latest update expanded the package portfolio with the available packages on Bioconductor and GitHub.

Implementing Bioconductor packages was very similar to implementing CRAN packages, but with a few caveats. The biggest one to overcome was that Bioconductor packages sometimes download massive datasets (> 1GB) upon installation, which makes installing and updating a very time consuming and storage space consuming task. To overcome this, we used the `parallel` package to run package installations in threads that were killed (with a SIGKILL signal to the process) if they didn’t terminate after some time. This way we avoided cluttering our machine, and the few packages we loose with this technique is worth the performance gain.

Adding GitHub support was very different. Credits go to Hadley Wickham’s r-on-github script. His script uses the GitHub api to search for all R repositories and their details (owner, stars, latest update, etc.). We only made some minor changes to his script to filter repositories on the amount of stars they have, this to cut out the many test repositories. The following graph plots the amount of R repositories based on the amount of stars they have.

Rplot

We decided that 3 or more stars was an acceptable metric to decide that a repository is “popular enough” for Rdocumentation. An arbitrary measure, but given the amounts shown in the graph above it seems that even taking 1 or more stars already discards the big majority of repositories. Once the repository information is collected, install_github() from devtools is used to install all of the packages on the server. After an initial install of all packages, only packages that have been updated/created within the last week on GitHub are considered for obvious performance reasons.

Any questions/remarks? Drop me a line at bram@datacamp.com

Share Button

New! Search GitHub and Bioconductor packages on Rdocumentation

As of today, you can search via Rdocumentation not only CRAN packages, but also the R packages available on GitHub and Bioconductor.  This is our largest update yet, and brings us one step closer to creating  a central place for all R documentation related info and questions.

Today, the rise of alternatives to CRAN package management system (Bioconductor, GitHub, …) can make finding and installing packages tedious. Documentation for CRAN packages is on cran.r-project.org, documentation for Bioconductor packages is on bioconductor.org, etc.

In addition, there is the current tendency of package developers to no longer (immediately) release their packages on CRAN, but to make use of GitHub. Just think of well-known packages such as ggvis and slidify that are only available on GitHub.  While the most popular GitHub R packages are passed on by word of mouth,  many good packages that are not on CRAN remain relatively unknown. Wouldn’t it be nice to be able to have an overview of all these packages that remain obscure in their GitHub repositories?

blog_git_bio

With this latest update, we aim to address these problems. Rdocumentation now supports automatic adding and updating of packages from both Bioconductor and GitHub, making us effectively the first R documentation aggregator that combines these sources into one searchable website.

If you have other suggestions, feel free to contact us via info@datacamp.com.

Share Button

Who wants to learn R? Sharing DataCamp’s user stats and insights.

When building an online education start-up for R the number one criterion to meet is the following: identify an increasing interest in learning R online. Once this box is checked, it is time to start thinking of the second most important criterion: establish a teaching approach that makes people so excited that they keep coming back to learn more, thereby turning them, slowly but surely, into black-belt R masters.

In order to investigate how DataCamp is performing on both criteria, we decided to analyze our user data for February in more detail, and to open up and share the results via this (comprehensive) Slidify presentation. We put some effort in the visualizations as well, so all results are prettified via rMaps, rCharts and googleVis. (For the curious souls among us, the presentation also gives a unique view on the status of DataCamp back then.)

Screenshot 2014-05-01 23.53.22

For DataCamp, February is one of the most interesting months so-far in terms of user data, as we added two new and free online interactive courses to our curriculum: Data Analysis and Statistical Inference and Introduction to Computational Finance. Courses that are/were also used as interactive R complements to the like-named Coursera courses. In February we welcomed over 14,000 new R enthusiasts, from a total of 163 countries. Our servers handled peak traffic of 1,000 requests per minute, and hundreds of concurrent users. Other insights that you will find in the presentation are:

  • Number of chapters started and finished by course
  • Geographical distribution of the DataCamp user base
  • Spillover effect across courses

Make sure to have a look, and if you want more information send your requests to info@datacamp.com.

Share Button

Decimal comma or decimal point? A googleVis visualization

As you all know, the decimal mark is a symbol used to separate the integer part from the fractional part of a number written in decimal form. Since I am born and raised in Continental Europe I am quite found of using the comma sign to indicate a decimal point. I’ve been growing up with it all my life, encountered this comma in both my literary and numerical escapades, and still shudder when thinking of its dual role in long divisions.

However, using the comma sign as a decimal mark has two consequences:

  • Since the comma “,” sign is already taken to mark the radix point, I’m obligated to use the dot “.” sign to separate the thousands.
  • As the comma “,” sign is only used by 24% of the world’s population, 76% of the people in the world are creating documents, writing texts, typing in working in spreadsheets, etc. that contain numbers with a fractional part that looks different from mine.

While the first consequence is the inevitable sacrifice one must make for having the privilege to use the comma for fractional numbers, the second is a much harder obstacle to deal with in the real world of professional number crunchers. More than necessary, I find myself struggling with sheets and documents that use the dot “.” sign to mark the radix point instead of my beloved comma. Why that often? Based on Wikipedia, roughly 60% of the world’s populations uses the dot “.” sign to mark the radix point. And when looking for these numbers, I learnt there are even more dissidents. In the Arab world they use the Arabic decimal separator for Eastern Arabic numerals, in Persian the decimal mark is called momayyez, and in English Braille the decimal mark even has its own sign…

Since politely asking these people to change their disrupting behavior will most likely be hopeless, and finger pointing is only a prerogative of the majority, I was forced to make myself a little tool to guide me through of what at the beginning looked like an insurmountable problem. Using the data I found on Wikipedia, I created with the help of the googleVis package a world map indicating the decimal separator used in each country. That way, depending on the origin of the sheet I receive, I always know what decimal mark to expect.

The map (full-size) is not yet complete (mainly in Africa there are some blank spots left). So for those that are aware of the prevailing decimal mark culture in these regions, just let me know in the comment section and I’ll make sure to update them in case of a comma, or to proselytize them otherwise ;-). You can find the original code here.

Share Button

Statistical Language Wars: The Infograph

A feature all programming communities have in common is the numerous debates about why their programming language of choice is better, more advanced, faster, holier etc. In today’s data science community, it seems like these discussions are omnipresent with advocates of SAS, SPSS, R, Python, Julia, etc. battling and challenging each other on every online medium. (side note: These ‘data driven’ debates are often a good example of how you can prove anything with statistics.)

While these debates are a good thing for the community and the programming language as a whole, they unfortunately also have a negative effect on those individuals just in the beginning of their data analytics career. Biased opinions on all sides of the table, make it difficult for new data analysts to see the forest for the trees.

Especially for this new group of data analysts (and future debaters), as well as for everyone else interested in learning data science or an additional statistical language, we created the infograph ‘Statistical Language Wars’ that gives a basic comparison between SAS, R and SPSS to see how they stack up. This to provide a more clear starting point.

Statistical language wars: SAS vs R vs SPSS
Source: blog.datacamp.com

We’ll make sure to regularly update this infograph based on the feedback you provide, and consider to create some new infographs that focus more on other players such as Python and Julia.

Feel free to share!

Embed Code:


<a href="http://blog.datacamp.com/statistical-language-wars-the-infograph/" ><img src="http://datacamp.wpengine.com/wp-content/uploads/2014/05/infograph.png" alt="Statistical language wars: SAS vs R vs SPSS" /></a><br/>Source: <a href="http://blog.datacamp.com">blog.datacamp.com</a><br/>
Share Button

MathJax binding in Angular.js

Introduction to Mathjax

MathJax is an open source JavaScript framework that makes formatting mathematical formulas easy. We use MathJax for rendering the many statistical formulas in the description of the DataCamp exercises.
Mathjax converts a string like:

[ left( sum_{k=1}^n a_k b_k right)^2
leq left( sum_{k=1}^n a_k^2 right)
left( sum_{k=1}^n b_k^2 right) ]

to

MthJax Formula

Combining MathJax with Angular.js

At DataCamp we use Angular.js as our JavaScript MVC framework to create the feeling of a “single-page application”. However, one of the downsides of using such an MVC framework is that it’s not always easy to couple it with other JavaScript frameworks like for instance MathJax.

Angular provides databinding between the model and the view (more info here).  The goal is to create something called a directive, that automatically renders the MathJax syntax when the model changes and the view is updated.

Step 1 : Include the MathJax source script

<script src="https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

Step 2 : Create the Directive

This code will create a mathjax directive that watches for any change on the ng-model attribute. This means that if a controller changes the $scope.mjx property, MathJax will automatically rerender the changed element.

If you have any questions or suggestions, feel free to let me know at dieter@datacamp.com

Share Button

The new look of learning R

Now that we have reached the milestone of 30,000 (!) enthusiastic R students, and the DataCamp platform is paving its way into academics and professional organizations, we felt it was time to take our design to a higher level. So as of this week, your favourite free learning platform for R tutorials and data science will have a totally new look and feel!

You will immediately notice a cleaner, lighter design that is centred around the individual student and his progress. Once logged in, you can now easily navigate between 60 hours of free interactive R content, be updated on our latest R learning materials, and continue your own study path with just one-click. Under the hood we still run good ol’ R in the cloud, supported by a portfolio of other open-source software llke AngularJS and NodeJS.

Screenshot 2014-04-27 20.33.14

Some of the main improvements are:

  • A completely redesigned learning environment. Now a student can easily navigate between exercises and chapters within the same view. So (s)he can finish a complete course without leaving the learning environment.
  • A more prominent place for the different gamification elements. This to spice up a student’s journey with rewards and social sharing.
  • A cleaner interface to manage and create your own interactive R courses. The updated documentation to create your own courses is now on GitHub.

This new design is only part of a bigger movement. Over the upcoming weeks and months we will release more R candy such as short screencasts, an updated version of our submission correctness tests, and of course new interactive R courses.

We hope you will enjoy it!

Share Button

Free interactive R exercises on OpenIntro

DataCamp first to offer free interactive R tutorials via the OpenIntro platform.

In one week, the ten-week Coursera course on Data Analysis and Statistical Inference by prof. Mine Çetinkaya-Rundel of Duke University comes to an end. At DataCamp it was one of our first experiences providing interactive R exercises on a large scale, and we’re proud to say this journey is coming to a successful end. (We’ll write a more detailed post on this in the near future.)

For those of you who were not able to follow the course in the Coursera format, but still want to do the interactive exercises on DataCamp, there is now a great alternative: OpenIntro statistics. OpenIntro statistics is part of the OpenIntro project, and covers a wide range of educational materials on statistics such as videos, textbooks, and as of now interactive exercises by DataCamp. If you’re a student looking for a great introductory statistics course, or a teacher in need of a fully fledged teaching material package, the OpenIntro project is the place to go! (The OpenIntro project is an organisation focused on developing free and affordable education materials. OpenIntro statistics is their first project)

OpenIntro_Logo_FOAS

All DataCamp R tutorials can be found under the labs section of the OpenIntro website. Just like for the Coursera course, these interactive exercises serve best as complements to the statistical concepts covered in the free OpenIntro statistics textbook and corresponding videos. If you’re a teacher using OpenIntro in your class, and you want to use the DataCamp tutorials as well, you can always contact us at teach@datacamp.com if you need more detailed information.

We’re happy being offered the opportunity to add our interactive R tutorials to the high-quality OpenIntro curriculum.  To us, it is again another step to increase the understanding and adoption of R in the data science and statistics world.

We hope you will enjoy it!

The DataCamp Team

Note: if you prefer to take the course via Coursera, a new session of the course has been announced and will launch September 1st 2014. 

Share Button

Get notified when R packages update

Today’s highly active R user base is developing, re-developing, and releasing R packages at a never-before-seen rate. While this is fantastic news for the R community as such, it inevitably also causes growing pains as mentioned before.

One of the often cited problems is the painful and time-consuming task to keep track of changes and version updates of packages and functions (see for example the paper of Jeroen Ooms in The R Journal). After all, nothing beats the fun of putting a lot of effort in a project or task, just to realize minutes after finishing the job that package xyz released its latest version. (To say nothing of the frightening but inevitable moment when loading in the new version, praying to God the fragile life of your precious code will be spared.)

A better way to deal with these package updates, is to be informed automatically when changes are made to the packages you depend on. This is exactly what the brand new notification feature of Rdocumentation does. It gives you the option to subscribe to the R-packages of your choice, and then when one of these packages gets updated on CRAN, Rdocumentation automatically sends an email to inform you.

Getting updates on future package versions via Rdocumentation is simple. Navigate to the package of your choice (let’s say ggplot2 on Rdocumentation), provide your email address, and hit the green subscribe button. A message will pop-up to confirm your subscription, and that’s it. This is also shown in the following screenshot:

ggplot2 rdocumentation

Rdocumentation is a tool that enables you to easily find and browse the documentation of all current packages and functions on CRAN. If offers features such as advanced search, package popularity rankings, community forums, and package download statistics. Rdocumentation is supported by DataCamp, provider of free R tutorials. 

Share Button

April Fools’ Day: The 7 Funniest Data Cartoons

To give this years April Fools’ day a more analytical touch, we decided last week do a little poll on internet cartoons. We asked our friends and colleagues to select their favourite data related cartoon on the web, and organized a voting session to construct a top 5 list. (You can always share your own favourites in the comments.)

We proudly present you the winners of the April Fools’ 2014 Data Cartoon awards:

Number One: The Cloud 

cloud-cartoon

Number Two: A Study on Statistics

statistics

Number Three: Pacman Statistics

pacman

Number Four: Dilbert One

dilbertone

Number Five: Haloween Statistics

haloween

Number Six: Dilbert Two

dilbertwo

Number Seven: XKCD Correlation

correlation

Disqualified for the competition, but still funny:

big data

 

 

Share Button