At OrangeTree Global, our students often ask us whether they should use R and/or Python for their day-to-day data analysis tasks. Although we have generally been focussing on R Programming and the ability to apply analytical techniques on R, we always answer that this choice depends on the type of data analytical challenge that they are facing.
Both Python and R are popular programming languages for statistics. While R’s functionality is developed with statisticians in mind (think of R’s strong data visualization capabilities!), Python is often praised for its easy-to-understand syntax.
In this post, we will highlight some of the differences between R and Python, and how they both have a place in the data science and statistics world. If you prefer a visual representation, make sure to check out the corresponding infographic ”Data Science Wars: R vs Python”.
Ross Ihaka and Robert Gentleman created the open-source language R in 1995 as an implementation of the S programming language. The purpose was to develop a language that focused on delivering a better and more user-friendly way to do data analysis, statistics and graphical models. At first, R was primarily used in academics and research, but lately the enterprise world is discovering R as well. This makes R one of the fastest growing statistical languages in the corporate world.
One of the main strengths of R is its huge community that provides support through mailing lists, user-contributed documentation and a very active Stack Overflow group. There is also CRAN, a huge repository of curated R packages to which users can easily contribute. These packages are a collection of R functions and data that make it easy to immediately get access to the latest techniques and functionalities without needing to develop everything from scratch yourself.
To end, if you’re an experienced programmer, you probably won’t have a hard time to get up to speed with R. As a beginner, however, you might find yourself struggling with the steep learning curve. Luckily, there are many great learning resources you can consult nowadays.
Python was created by Guido Van Rossem in 1991 and emphasizes productivity and code readability. Programmers that want to delve into data analysis or apply statistical techniques are some of the main users of Python for statistical purposes.
The closer you get to working in an engineering environment, the more likely it is you might prefer Python. It’s a flexible language that is great to do something novel, and given its focus on readability and simplicity, its learning curve is relatively low.
Similar to R, Python has packages as well. PyPi is the Python Package index and consists of libraries to which users can contribute. Just like R, Python has a great community but it is a bit more scattered, since it’s a general purpose language. Nevertheless, Python for data science is rapidly claiming a more dominant position in the Python universe: the expectations are growing and more innovative data science applications will see their origin here.
R and Python: The General Numbers
On the web, you can find many numbers comparing the adoption and popularity of R and Python. While these figures often give a good indication on how these two languages are evolving in the overall ecosystem of computer science, it’s hard to compare them side-by-side. The main reason for this is that you will find R only in a data science environment; As a general purpose language, Python, on the other hand, is widely used in many fields, such as web development. This often biases the ranking results in favor of Python, while the salaries are affected somewhat negatively.
R Vs Python Numbers
When and how to use R?
R is mainly used when the data analysis task requires standalone computing or analysis on individual servers. It’s great for exploratory work, and it’s handy for almost any type of data analysis because of the huge number of packages and readily usable tests that often provide you with the necessary tools to get up and running quickly. R can even be part of a big data solution.
When getting started with R, a good first step is to install the amazing RStudio IDE. Once this is done, we recommend you to have a look at the following popular packages:
dplyr, plyr and data.table to easily manipulate packages,
stringr to manipulate strings,
zoo to work with regular and irregular time series,
ggvis, lattice, and ggplot2 to visualize data, and
caret for machine learning
When and how to use Python?
You can use Python when your data analysis tasks need to be integrated with web apps or if statistics code needs to be incorporated into a production database. Being a fully fledged programming language, it’s a great tool to implement algorithms for production use.
While the infancy of Python packages for data analysis was an issue in the past, this has improved significantly over the years. Make sure to install NumPy /SciPy (scientific computing) and pandas (data manipulation) to make Python usable for data analysis. Also have a look at matplotlib to make graphics, and scikit-learn for machine learning.
Unlike R, Python has no clear “winning” IDE. We recommend you to have a look at Spyder, IPython Notebook and Rodeo to see which one best fits your needs.
R and Python: The Data Science Numbers
If you look at recent polls that focus on programming languages used for data analysis, R often is a clear winner. If you focus specifically on Python and R’s data analysis community, a similar pattern appears.
R vs Python Activity
Despite the above figures, there are signals that more people are switching from R to Python. Furthermore, there is a growing group of individuals using a combination of both languages when appropriate. This is exactly in line with what we recommend to our students as well.
If you’re planning to start a career in data science, you are good with both languages. Job trends indicated an increasing demand for both skills, and wages are well above average.
R: Pros and Cons
Pro: A picture says more than a thousands words
Visualized data can often be understood more efficiently and effectively than the raw numbers alone. R and visualization are a perfect match. Some must-see visualization packages are ggplot2, ggvis, googleVis and rCharts.
Pro: R ecosystem
R has a rich ecosystem of cutting-edge packages and active community. Packages are available at CRAN, BioConductor and Github. You can search through all R packages at Rdocumentation.
Pro: R lingua franca of data science
R is developed by statisticians for statisticians. They can communicate ideas and concepts through R code and packages, you don’t necessarily need a computer science background to get started. Furthermore, it is increasingly adopted outside of academia.
Pro/Con: R is slow
R was developed to make the life of statisticians easier, not the life of your computer. Although R can be experienced as slow due to poorly written code, there are multiple packages to improve R’s performance: pqR, renjin and FastR, Riposte and many more.
Con: R has a steep learning curve
R’s learning curve is non-trivial, especially if you come from a GUI for your statistical analysis. Even finding packages can be time consuming if you’re not familiar with it.
Python: Pros and Cons
Pro: IPython Notebook
The IPython Notebook makes it easier to work with Python and data. You can easily share notebooks with colleagues, without having them to install anything. This drastically reduces the overhead of organizing code, output and notes files. This will allow you to spend more time doing real work.
Pro: A general purpose language
Python is a general purpose language that is easy and intuitive. This gives it a relatively flat learning curve, and it increases the speed at which you can write a program. In short, you need less time to code and you have more time to play around with it!
Furthermore, the Python testing framework is a built-in, low-barrier-to-entry testing framework that encourages good test coverage. This guarantees your code is reusable and dependable.
Pro: A multi purpose language
Python brings people with different backgrounds together. As a common, easy to understand language that is known by programmers and that can easily be learnt by statisticians, you can build a single tool that integrates with every part of your workflow.
Visualizations are an important criteria when choosing data analysis software. Although Python has some nice visualization libraries, such as Seaborn, Bokeh and Pygal, there are maybe too many options to choose from. Moreover, compared to R, visualizations are usually more convoluted, and the results are not always so pleasing to the eye.
Con: Python is a challenger
Python is a challenger to R. It does not offer an alternative to the hundreds of essential R packages. Although it’s catching up, it’s still unclear if this will make people give up R?
And the winner is..
Up to you! As a data scientist it’s your job to pick the language that best fits the needs. Some questions that can help you:
- What problems do you want to solve?
- What are the net costs for learning a language?
- What are the commonly used tools in your field?
- What are the other available tools and how do these relate to the commonly used tools?
Do not choose between R & Python, learn both
In general, you shouldn’t be choosing between R and Python, but instead should be working towards having both in your toolbox. Investing your time into acquiring working knowledge of the two languages is worthwhile and practical for multiple reasons.
It strengthens your data science communication skills
Both R and Python have strong online communities such as R-bloggers and python.org dedicated to the respective languages. Looking at these sites you can get the impression that R and Python communities are completely disjoint. Unnecessary to state that is not the case.
In the real world of data science, Python and R users intersect a lot. So whichever industry or discipline you are interested in you are likely to run into projects done in both languages. To appreciate it all you need to have at least a basic understanding of both R and Python. Furthermore, by mastering both, you have the advantage and versatility of presenting and communicating effectively regardless of whether your audience is more comfortable with R or Python. So if you strive to become a data scientist, you will eventually need to be fairly familiar with both languages, and most likely a whole lot more.
It boosts your data science career
Knowing both R and Python will open doors for more job opportunities. Some companies, or departments within companies might prefer Python, while other like to work with R. Imagine that you are a perfect fit for the job, except that you know R while the company requires you to know Python. Wouldn’t that suck? Generally professionals from the industry encourage entrants to acquire as many tools and skills as they can. Most of the time you won’t be expected to be a complete master of R or Python, but displaying your commitment and passion by having learned at least some of both will only give you bonus points.
It is not that hard
You can think of Python and R as Spanish and Italian; they are both very different and very similar at the same time. They have a different syntax and have their own (technical) advantages, but at the same time they become very similar when appropriate Python packages are used (numpy, pandas, …). For example:
Suppose you want to load csv files. In R you have a couple of options, one of which is read_csv(…). In Python you can use a function from the Pandas library with the code pd.read_csv(…). Spot the difference!
Also, both Python and R are what is considered «scripting languages» which allows you to write snippets of executable code without having to use a compiler like when using Java for example. Next, they both have libraries and packages that you load into your environment to add functionality and do the tasks you need to complete. In addition, when working with both you will experience that your workflow for both languages is very similar, as are the documentations and communities surrounding them.