How to become a data scientist in 8 easy steps: the infographic

This post was written by the team behind DataCamp, the online interactive learning platform for data science.  

After being dubbed “sexiest job of the 21st Century” by Harvard Business Review, data scientists have stirred the interest of the general public. Many people are intrigued by this job, namely because the name has an interesting ring to it. But it is exactly the name that also raises a lot of questions. Because what is a data scientist and what do data scientists do exactly? Many of us who devote their lives to data science have frequently been confronted with questions like these.

The answers to these questions are mostly not as straightforward as you would expect: a short search on Google with the string of words “How to become a data scientist” shows that the concept has different meanings to different people. In addition, many articles indeed suggest various tools, courses and applications for people to become a data scientist, and with good reason: the options are unlimited. But let’s face it, for someone that is not familiar with the field, this advice may sometimes seem like a jungle of information. What’s more, they could work demotivating: the descriptions are sometimes fearfully long and the many details often hit the readers as an overwhelming avalanche.

DataCamp’s Guide to Becoming a Data Scientist

With all this in mind, DataCamp decided to help those who can’t see the forest for the trees: we designed a step-by-step infographic that clearly outlines how you can become a data scientist in 8 easy steps.  This visual guide is meant for everyone that is interested in learning data science or for everyone that has already become a data scientist but wants some additional resources for further perfection.  The infographic is called “Become a data scientist in 8 easy steps”. Have a look at it!

How to become a data scientist
























If you are thinking about becoming a data scientist, do not be taken aback by the eight steps that are presented in the infographic. We would like to emphasize that becoming a data scientist takes time and personal investment, but that the journey is everything but dull! And don’t forget, there are plenty of courses available to set you on the right way.

If you are already a data scientist, drop us a line at if you think of other steps that you have undertaken in your professional journey.

Feel free to share!

Embed Code:

<a href="" ><img src="" alt="Become a data scientist in 8 easy steps" /></a><br/>Source: <a href=""></a><br/>
Share Button

New dplyr course by RStudio and DataCamp! Learn data manipulation interactively

DataCamp just launched its latest interactive course: dplyr. This new course was developed in close collaboration with Garrett Grolemund, RStudio’s master instructor. By taking this course, you will be challenged one step at a time to master the essentials about transforming data sets fast and intuitively with the dplyr package. Start the course here.

The dplyr package is an exciting new chapter in the mission to bring painless data manipulation to the crowd. It is an R package that provides you with a fast and intuitive way to transform data sets with R. dplyr is the successor of plyr and is mainly authored by Hadley Wickham and Romain Francois. It is designed to be intuitive and easy to learn, thereby making “doing things” in R more user friendly.

It introduces five key functions to straightforwardly manipulate data: select, mutate, filter, arrange and summarize. Thanks to optimization in C++, these functions allow you to work extremely fast with larger data sets. These ‘dplyr verbs’ can be understood as the atoms that combine to powerful molecular operations which can handle around 90% of data manipulation tasks. As such, dplyr lets you, as a data scientist, accomplish more things, with more data, in less time. However, dplyr isn’t limited to these five functions; it also enables automated groupwise operations in R, it provides a standard syntax for accessing and manipulating database data with R, and much more. All of this and more is covered and explained in this DataCamp course (check out the contents of the course).

To help you fully grasp the power and ease-of-use of dplyr, DataCamp has developed a brand new interactive course together with Garrett Grolemund. Garrett is a Data Scientist and Master Instructor at RStudio, holds a Ph.D. in Statistics, and specializes in teaching. He is the author of Hands on Programming with R, as well as Data Science with R, an upcoming book from O’Reilly Media. He taught people how to use R at over 50 government agencies, small businesses, and multi-billion dollar global companies.

A video of Garrett Grolemund explaining the dplyr package in the DataCamp course
A video of Garrett Grolemund explaining the dplyr package in the DataCamp course

In the course, you will learn how to use dplyr to perform basic data manipulation tasks using the five dplyr verbs, as well as combining these to solve challenging problems. You’ll also learn about groupwise operations using group_by(), about the pipe operator to chain your operations, and about the tbl structure which provides a cleaner layout so you can better understand your data. Finally, you will learn how to use the dplyr syntax to access data stored in a database outside R.

The course is set up in DataCamp’s interactive learning platform that aims to enhance your learning experience by allowing you to learn by doing. The course is comprised of 10 sections distributed over five chapters and each section has an instructional video by Garrett, followed by a vast set of interactive exercises. As such, the concepts that are introduced during the video lecture are directly tested through challenging assignments with tailored feedback to consolidate your knowledge step by step. You will effectively learn hands on instead of losing time with suboptimal solutions like a four-hour screencast or webinar.

The DataCamp interactive learning environment
The DataCamp interactive learning environment

This is the first course of the RStudio datacamp track that will cover some of the company’s flagship products: dplyr, ggvis, rmarkdown, and the RStudio IDE. These other courses are scheduled to launch later this year.

So, if you want to learn more about the powerful dplyr package to solve challenging data analysis problems, head over to DataCamp and start right away!

DataCamp - Learning By Doing

Share Button

New Course! A hands-on introduction to statistics with R by A. Conway (Princeton University)

The best way to learn is at your own pace. Combining the interactive R learning environment of DataCamp and the expertise of Prof. Conway of Princeton, we offer you an extensive online course on introductory statistics with R.  Start learning now…

Whether you are a professional using statistics in your job, an academic wanting a refresher on specific statistical topics, or a student taking statistics classes, this new DataCamp course will match your needs. It is a comprehensive and friendly course, that requires no background knowledge in statistics or R. The aim is to provide you with a solid foundation for future learning, as well as being able to put one’s work into context. All this takes place in your browser thanks to the DataCamp online learning environment. Try it for free!

So, how does it all work? You can choose to subscribe to the course as a whole, or to take individual modules according to your own specific needs. The course consists of 7 modules, ranging from the Student’s T-test over ANOVA to simple and multiple linear regression, finally ending with a last module on Moderation and Mediation.  In total there are more than 250 interactive R exercises, which are accompanied by videos and slides. This adds up to 24 hours of material .

Try the DataCamp course co-developed by Prof. Conway

Interested?  To give you the opportunity to get a taste of the course content and to try out the DataCamp learning experience, we present you the first module for free. Furthermore, if you are a student, we want you to know that you get a  75% discount on the whole course.

So what are you waiting for? Grab this learning opportunity and check out the course! Remember that the first module is free, that you can buy separate modules according to your needs, and if you buy all 7 modules at once, you get a significant discount.  On top of that, students can get a 75% reduction on the whole course.

On Professor Andrew Conway
Prof. Conway is a Senior Lecturer at Princeton and has been teaching to undergrads and graduate students for 20 years. His experience is reflected in the quality of this course. The content of this course has been on Coursera, and back then more than 200,000 individuals followed it, making it the second most popular Coursera course using R.  Psychology students at Princeton are already following the DataCamp course this semester.

 On DataCamp
The course is set up in DataCamp’s interactive platform that aims to enhance the learning experience by offering a learning-by-doing approach. The material is presented by short videos and slides to explain major elements. In order to consolidate your learning, every section ends with interactive exercises that let you practice the covered concepts while giving you tailored feedback.

You will discover R’s capabilities and how they interplay with each other step by step. You can learn at your own pace, stopping to take a break or replay a segment at any time. The system tracks your progress so you can stop at any time; it will start up where you left off. This way, you will learn effectively instead of losing time with one-speed-fits-all solutions like a four-hour screencast or webinar. What’s more, in order to consolidate your learning, every section ends with interactive exercises that let you practice the covered concepts while giving you tailored feedback.



Share Button

Data analysis the data.table way: introducing DataCamp’s newest course

Together with the key people behind the data.table package, Matt Dowle and Arun Srinivasan,  DataCamp developed a brand new interactive course to bring your data analysis skillset up to date with the essentials of the powerful data.table package. Learn more… 

The popularity of the data.table package is increasing and with good reason. Not only is the number of package downloads rising rapidly, but data.table is also talk of the R town given the numerous presentations of Matt and Arun at conferences such as useR!2014, EARL, R/Insurance and R/Finance.

Data.table allows you to reduce your programming time as well as your computing time considerably, and it is especially useful if you often find yourself working with large datasets.  For example, to read in a 20GB .csv file with 200 million rows and 16 columns, data.table only needs 8 minutes thanks to the fread()function.  This  is instead of the hours it would take you with the read.csv() function. Once you understand its concepts and principles, the speed and simplicity of the package are astonishing!

However, to get the most out of data.table’s functionalities, you first have to overcome its learning curve: even though the syntax is not extremely difficult, it does take some practice to fully grasp it so its built-in functionalities can make your life easier. This is exactly why DataCamp has made an interactive online course on the data.table package for R and it has done so in collaboration with the key people behind it, namely Matt Dowle, main author, and Arun Srinivasan, co-author and major contributor. This course, which is unique as it is the only one of its kind, is called Data Analysis: the data.table way. It is designed to help you get started with the essentials of the data.table package. Among other things, you will learn all there is to know about operations such as selection and grouping in DT[i, j, by], and intermediate topics like chaining, setting keys and the different join types.


The course is set up in DataCamp’s interactive learning platform that aims to enhance the learning experience by centering on learning-by-doing. The course is supplemented by short videos and slides to explain major elements.  You will discover the functionalities and how they interplay with each other step by step. This way, you will effectively learn hands on instead of losing time with suboptimal solutions like a four-hour screencast or webinar. What’s more, in order to consolidate your learning, every section ends with interactive exercises that let you practice the covered concepts while giving you tailored feedback.

So, if you are looking for a qualitative course that brings you up to speed with one of the hottest packages in R today, go to DataCamp and add the power of data.table to your data analytical skillset!

Share Button

New R course on Coursera: Data Analysis and Statistical Inference

Yesterday (Monday 1st of September), a new session of Data Analysis and Statistical Inference, taught by Doctor Mine Çetinkaya-Rundel from Duke university, has started on Coursera. Just like with the previous run, all labs take place in DataCamp’s interactive learning environment.

Data Analysis and Statistical Inference will teach you how to make use of data in the face of uncertainty. Throughout the course, you will learn how to collect, analyze, and use data to make inferences and conclusions about real world phenomena. No formal background is required, but mathematical skills are definitely a plus.


The course makes intensive use of R for its statistical computing; its corresponding interactive exercises, available on DataCamp, were developed in close collaboration with doctor Çetinkaya-Rundel. In sum, this course is perfectly tailored to your needs if you are a starting data scientist and you are looking to expand your basic statistical knowledge.

We hope to welcome you in our online classroom soon.

P.S. In case you prefer to complete the course self paced in your own time, we recommend you to have a look at the open intro course. Here you can find similar material that is also supplemented with the interactive exercises of DataCamp.

Share Button

Coursera course on computational finance with R

As of today (Tuesday 26th of August), a new session of Professor Eric Zivot’s course on computational finance and financial econometrics starts on Coursera. Just like the previous run of the course, most R labs and R assignments will take place in DataCamp’s interactive learning environment.

Designed by Professor Eric Zivot (University of Washington), Introduction to computational finance focuses on mathematical and statistical tools and techniques that are used in quantitative and computational finance. With the help of real-life examples, you will be introduced to the dos and don’ts of financial data analysis, estimations of statistical models and the construction of optimized portfolios. The course requires no formal background, but some basic mathematical skills will definitely come in handy.


DataCamp’s interactive R exercises are developed in close collaboration with Professor Zivot himself.  They therefore have the same high-quality standards as academic courses, but presented in DataCamp’s fun and learning-by-doing environment. All students that choose to enroll for the course on Coursera will be directed to DataCamp to practice their skills and to complete assignments.

If you always wanted to learn more about computational finance, or if you are just interested in doing financial econometrics with R, this course is a must-do for sure. We hope to welcome you in our online classroom soon!

PS. In case you prefer to only do the interactive exercises, the course is also available on DataCamp as a stand-alone version which does require prior knowledge about finance and R.

Share Button

Package rankings, task view rankings and much more: the Rdocumentation poster

Especially for useR! 2014 we created a poster on, our R documentation aggregator that lets you search packages from CRAN, Bioconductor and GitHub. We received a lot of positive and valuable feedback on it, so we decided to share it again via our blog.

Screenshot 2014-06-28 16.36.26

The Rdocumentation poster answers questions such as:

  • What are the 10 most downloaded R packages of all time?
  • Who maintains the most  packages?
  • What are the most popular task views, and how did this ranking change over time?
  • How popular is GitHub in the R community?

So if you  did not make it to the useR! 2014 conference, or in case you could not attend the Wednesday evening poster session, you hereby have another chance to have a detailed look at the poster.  Feedback and questions can be sent to

Feel free to  share it with your network!

On Rdocumentation is a web application that helps you to easily find and browse the documentation of R packages on CRAN, Bioconductor and GitHub. It enables you to instantly search for functions and use advanced search on the documentation of all R packages.

Share Button

Who wants to disrupt R training and R education?

Using EdTech applications as an academic, R trainer or training company is no longer a unique selling proposition, but a must-have commodity. In this post, we introduce the new DataCamp course creation tools for academics, trainers and enterprises, and make a call to those who are interested in using these tools. More information via

Everyone that is involved in academic teaching or professional training is experiencing how a new wave of online education tools is changing the way things are done inside and outside the classroom. Using EdTech is no longer a differentiator, but a must-have.  One must dare to go beyond the standard offering of webinars and the traditional collection of online instruction videos. All these new tools are disrupting the current business models around trainings, as well as the educational and pedagogic models used today.

New Features

For over a year now, DataCamp has been working on building tailored and scalable EdTech tools for R education and training. We do these developments for our own R courses and tutorials -in 2014 alone we already trained over 42,000 new R enthusiast- and for academics, trainers and companies using R in a narrow or a broad sense. In the past months, we have been working hard on some serious improvements to our course creation tools and today we make these new developments available to the public. You can now:

  • Integrate video material and add slides,
  • Write better and more complex submission correctness tests (more on that in a follow-up blogpost soon),
  • And – upon request- set up private learning environments for your students and clients and track their individual performance. (We’ll make sure to cover the technicalities of these new features in more detail in one of our next posts.)          

Furthermore, we have created a whole new FAQ section that answers questions on the course creation process itself, how to track student and employee performance, ways to use DataCamp for your courses, books and trainings,  our offer for tailored online and live trainings, and much more. Additionally, to help future course creators, we have developed our own course creation style guide based on the style guide published by Hadley Wickham.


For academics, trainers and enterprises

With these new functionalities and tools we want to meet the new needs and challenges of:

  • academics that want to complement their academic lectures with new and exciting interactive exercises,
  • trainers and training companies that need to adapt their online and live course portfolio to the changing business environment,
  • package authors and R enthusiasts that want to create their own online course on their favourite topic or package,
  • and (large) enterprises in need of cost-effective, but yet tailored and scalable trainings.

If you are an academic that is interested in using DataCamp, a professional trainer or training company that is ready to integrate EdTech into your course portfolio, or a company in need of high-quality cost-effective R training, contact us at or go to our teaching site.

Share Button

Including GitHub and Bioconductor on Rdocumentation: Technical Details

In our last blog post we announced the addition of GitHub and  Bioconductor R packages to Rdocumentation. For the more technical amongst you, I’ll give a short, high-level description of what’s under the hood of Rdocumentation. Along with that, I’ll zoom in on some of the challenges that I encountered while adding GitHub and Bioconductor repositories.


Rdocumentation in a (technical) nutshell
In a nutshell, the Rdocumentation web server communicates with an R server that’s running in the background. Using a cron job, this R server executes the following steps on a daily basis:

  1. Check for all available packages and their version numbers using available.packages().
  2. Compare these with the ones on Rdocumentation.
  3. Install/update the ones that are out of sync.
  4. Generate the documentation for the newly installed/updated packages and store it in a zip file.

The Rdocumentation web server then picks up the newly generated documentation from the R server, parses it, and stores it in its database.

This setup effectively creates a fully automated documentation service. However, installing all R packages on a single machine is by no means a trivial task. Many packages depend on certain (often Linux-specific) libraries such as C++ header files from various develoment packages. These dependencies cause installation failure and require manual intervention. We hope to get to the point where we can run a setup procedure on a server to prepare it for installation of all R packages, but for now this is a work in progress. Another problem is that when R updates, many packages break on installation. We’ve opted to ignore this for now, and to not update packages that don’t install on R’s latest version.

Adding GitHub and Bioconductor repositories
The first version of Rdocumentation only included the packages available on CRAN.  Our latest update expanded the package portfolio with the available packages on Bioconductor and GitHub.

Implementing Bioconductor packages was very similar to implementing CRAN packages, but with a few caveats. The biggest one to overcome was that Bioconductor packages sometimes download massive datasets (> 1GB) upon installation, which makes installing and updating a very time consuming and storage space consuming task. To overcome this, we used the `parallel` package to run package installations in threads that were killed (with a SIGKILL signal to the process) if they didn’t terminate after some time. This way we avoided cluttering our machine, and the few packages we loose with this technique is worth the performance gain.

Adding GitHub support was very different. Credits go to Hadley Wickham’s r-on-github script. His script uses the GitHub api to search for all R repositories and their details (owner, stars, latest update, etc.). We only made some minor changes to his script to filter repositories on the amount of stars that they have, this to cut out the many test repositories. The following graph plots the amount of R repositories based on the amount of stars that they have.


We decided that 3 or more stars was an acceptable metric to decide that a repository is “popular enough” for Rdocumentation. An arbitrary measure, but given the amounts shown in the graph above it seems that even taking 1 or more stars already discards the big majority of repositories. Once the repository information is collected, install_github() from devtools is used to install all of the packages on the server. After an initial install of all packages, only packages that have been updated/created within the last week on GitHub are considered for obvious performance reasons.

Any questions/remarks? Drop me a line at

Share Button

New! Search GitHub and Bioconductor packages on Rdocumentation

As of today, you can search Rdocumentation not only for CRAN packages, but also for the R packages available on GitHub and Bioconductor. This is our largest update yet, and brings us one step closer to creating  a central place for all R documentation related info and questions.

Today, the rise of alternatives to CRAN package management system (Bioconductor, GitHub, …) can make finding and installing packages tedious. Documentation for CRAN packages is on, documentation for Bioconductor packages is on, etc.

In addition, there is the current tendency of package developers to no longer (immediately) release their packages on CRAN, but to make use of GitHub. Just think of well-known packages such as ggvis and slidify that are only available on GitHub.  While the most popular GitHub R packages are passed on by word of mouth,  many good packages that are not on CRAN remain relatively unknown. Wouldn’t it be nice to be able to have an overview of all these packages that remain obscure in their GitHub repositories?


With this latest update, we aim to address these problems. Rdocumentation now supports automatic adding and updating of packages from both Bioconductor and GitHub, making us effectively the first R documentation aggregator that combines these sources into one searchable website.

If you have other suggestions, feel free to contact us via

Share Button