25 Python Libraries and Functions for Data Science
Libraries for Data Science are a lot! Although these libraries are hardly used at the same time, it is important as a data scientist or an aspiring Data Scientist to know which libraries can be used for what purposes in Data Science. Subsequently, you can investigate more if for any reason you have to use them for your projects or practice. I’d be mentioning 25 libraries that are popularly used for data science. While it is important to know these libraries exists, I feel it is important to have an idea as to what functions exists within these libraries. Mentioning 25 libraries and their functions won’t make any sense, I mean, because you’re probably never going to read through. But then, mentioning a few that you regularly use or have seen a lot of people use should be helpful. It is also great practice to know where these functions are, so you don’t go on importing every module where you suspect a function to be while writing your codes.
So these 25 libraries are listed below:
1. Tensor Flow: for deep learning
2. Numpy (Numerical Python): for array manipulations
3. SciPy (Scientific Python): for mathematics and scientific problems
4. Matplotlib: for data visualization
5. Scrapy: used for scraping data from websites
6. BeautifulSoup: for parsing (extracting) out data from html and xml files
7. SciKit (SciPy tool kit) -Learn: for machine learning
8. Pytorch: for deep learning, basically computer vision and natural language processing
9. Keras: for deep learning; acts as an interface for tensor flow
10. XGBoost (Extreme Gradient Boosting): for gradient boosting. It improves speed and performance of models
11. Seaborn: for visualization built upon matplotlib
12. Bokeh: for interactive visualization on modern web browsers
13. Plotly: an online visualization library
14. Pydot: provides an interface for GraphViz Dot language
15. GraphViz (Graph Visualization): for visualization which involves connecting node or edges
16. Statsmodel: for statistical models and tests
17. SpaCy: for natural language processing, extracting textual information
18. Gensim (Generate Similar): for machine learning; unsupervised topic modelling and Natural Language processing.
19. NLTK (Natural Language Tool Kit): for natural language processing and machine learning
20. Pybrain (Python Based Reinforcement Learning, Artificial Intelligence and Neural Network) : for machine learning, provides algorithm and algorithm tests environment
21. PyQt: Python’s interface for QT used for graphical applications development
22. OpenCV( Open Source Computer Vision): for real time machine learning software development
23. GGplot (Grammar of Graphics Plot): An intelligent visualization library
24. Shogun: for machine learning; contains algorithms for machine learning
25. Theano: machine learning library for numerical computing for deep learning and multi-dimensional array manipulation
So, it is almost impossible to have started your career in Data Science and not encounter the popular libraries; Matplotlib, Numpy and Pandas. I would be naming just a few functions from these libraries. Of course, these libraries have a lot of functions, A LOT! In fact, you may never use some functions under these libraries, not necessarily because you aren’t exploring these libraries enough, but because you might have some other way of achieving your goal without necessarily using these functions. I also remember when I started using these libraries for Data Science, it was difficult for me to remember where these functions were found and so most times I would just import all these popular libraries. But then, you don’t have to do that, be confident of where these functions come from and easily remember what they are used for. Functions are usually written as an extension from the library (e.g plt.plot), other times, a class is written before the function (e.g., np.random.randint). In the mentioned cases, above, “plot” is a function and “randint” is also a function. So let’s dive into it mentioning functions in these popular libraries.
Matplotlib:
This is used for visualization. Matplotlib is popularly represented with an alias “plt”. After this module is imported, it is given the alias “plt”. Hence, instead of writing Matplotlib in full subsequently while writing codes, plt can be used instead.
Some functions popularly used under Matplotlib are as follows. I would give detailed samples of how these are used in subsequent articles.
1. plot: plots a line graph
2. subplot: this adds a subplot to a figure
3. xscale: this sets the x axis scale
4. yscale: this sets the y axis scale
5. title: this creates a title for the graph
6. show: this displays a figure
7. xlabel: this labels the x axis
8. ylabel: this labels the y axis
9. figure: this creates a new figure
10. xlim: sets a limit to the x-axis
11. ylim: sets a limit to the y axis
12. bar: creates a bar graph
13. imsave: saves an image
14. figtext: adds a text to a figure
15. legend: creates a legend for the graph
16. lable: this adds a table to an axis
17. tick_params: this changes the properties of ticks, tick labels and grid lines
18. vlines: plots vertical lines
19. colorbar: this adds a colorbar to a plot
20. grid: this arranges the grid lines
21. plot_date: plots data that contains date
22. margins: sets an auto scale margin
23. arrow: adds an arrow to the axis
24. clf: closes a figure window
25. ioff: turns interactive mode off
Pandas:
The name Pandas comes from the term, Python for Data Analysis. Pandas is used for data manipulation, wrangling, merging, analysis and reshaping. Pandas is usually imported and given the alias “pd”. Pandas produces a 2 dimensional table object called a DataFrame.
Some functions found in the Pandas library are shown below:
1. read_csv: this imports a csv file
2. read_html(url): this extracts a table from a web page
3. to_csv: converts a result to a csv file
4. to_datetime: converts a column to date, time or duration format
5. series: creates a series
6. head: displays first top rows
7. tail: displays last bottom rows
8. shape: shows the dimensions of the DataFrame
9. info: displays concise summary of a DataFrame
10. value_counts: number of unique values
11. iloc: this does a selection by position
12. isnull: detects missing values
13. notnull: detects values not missing
14. dropna: drops missing rows
15. fillna: replaces null values
16. replace: replaces values with given values
17. sort_values: arranges values either in ascending or descending order
18. groupby: groups values from one column based on uniqueness
19. append: adds rows to the end of another dataframe.
20. concat: combines by attaching the rows of one DataFrame to another
21. mean: gives the mean value of a column
22. max: gives the highest values in a column
23. unique: returns the unique values of a series
24. min: gives the lowest value in a column of a DataFrame
25. count: displays number of not null rows of a DataFrame
Numpy :
Numerical Python (Numpy) is used for multi-dimensional arrays. This library contains mathematical functions that can be applied on arrays. Numpy is popularly written as an alias “np” within python codes.
There are many functions in Numpy and some of the frequently used functions are mentioned below:
1. loadtxt: loads data form a text file
2. savetxt: writes to a csv or text file
3. array: to create one or two dimensional arrays
4. arrange: creates an array with given range and step size
5. randint: generates an array of random integers
6. tolist: converts an array to list
7. sort: sorts an array
8. reshape: reshapes an array to a given number of rows and columns
9. split: splits array into a given number of array
10. add: adds a particular value to each element in an array
11. multiply: multiplies all elements in an array with a particular value
12. divide: divides all elements in an array with a particular value
13. ceil: rounds up an array to the nearest integer
14. floor: rounds down to the nearest integer
15. sum: sums up an array
16. min: gives the minimum value of an array
17. max: produces the maximum value of an array
18. std: gives the standard deviation of a particular column.
19. seed: this saves the state of a random function so it produces the same values when run over and over again.
20. dtype: returns the types of elements in an array
21. shape: returns the dimension of an array (that is the number of rows and number of column)
22. insert: inserts a value into a specified position
23. abs: produces the absolute values of elements in an array
24. ndim: produces the number of dimension of your array.
25. subtract: subtracts a particular elements from each element in an array.
These functions have structured ways in which they are used, alongside parameters.
But hopefully, you now have an idea of where you can find several functions in these popular libraries and what they are used for.
Go ahead and explore your popularly used libraries even more!