25 Python Libraries and Functions for Data Science

Gold Ochim
The Startup
Published in
7 min readNov 21, 2020

--

Libraries for Data Science are a lot! Although these libraries are hardly used at the same time, it is important as a data scientist or an aspiring Data Scientist to know which libraries can be used for what purposes in Data Science. Subsequently, you can investigate more if for any reason you have to use them for your projects or practice. I’d be mentioning 25 libraries that are popularly used for data science. While it is important to know these libraries exists, I feel it is important to have an idea as to what functions exists within these libraries. Mentioning 25 libraries and their functions won’t make any sense, I mean, because you’re probably never going to read through. But then, mentioning a few that you regularly use or have seen a lot of people use should be helpful. It is also great practice to know where these functions are, so you don’t go on importing every module where you suspect a function to be while writing your codes.

So these 25 libraries are listed below:

1. Tensor Flow: for deep learning

2. Numpy (Numerical Python): for array manipulations

3. SciPy (Scientific Python): for mathematics and scientific problems

4. Matplotlib: for data visualization

5. Scrapy: used for scraping data from websites

6. BeautifulSoup: for parsing (extracting) out data from html and xml files

7. SciKit (SciPy tool kit) -Learn: for machine learning

8. Pytorch: for deep learning, basically computer vision and natural language processing

9. Keras: for deep learning; acts as an interface for tensor flow

10. XGBoost (Extreme Gradient Boosting): for gradient boosting. It improves speed and performance of models

11. Seaborn: for visualization built upon matplotlib

12. Bokeh: for interactive visualization on modern web browsers

13. Plotly: an online visualization library

14. Pydot: provides an interface for GraphViz Dot language

15. GraphViz (Graph Visualization): for visualization which involves connecting node or edges

16. Statsmodel: for statistical models and tests

17. SpaCy: for natural language processing, extracting textual information

18. Gensim (Generate Similar): for machine learning; unsupervised topic modelling and Natural Language processing.

19. NLTK (Natural Language Tool Kit): for natural language processing and machine learning

20. Pybrain (Python Based Reinforcement Learning, Artificial Intelligence and Neural Network) : for machine learning, provides algorithm and algorithm tests environment

21. PyQt: Python’s interface for QT used for graphical applications development

22. OpenCV( Open Source Computer Vision): for real time machine learning software development

23. GGplot (Grammar of Graphics Plot): An intelligent visualization library

24. Shogun: for machine learning; contains algorithms for machine learning

25. Theano: machine learning library for numerical computing for deep learning and multi-dimensional array manipulation

So, it is almost impossible to have started your career in Data Science and not encounter the popular libraries; Matplotlib, Numpy and Pandas. I would be naming just a few functions from these libraries. Of course, these libraries have a lot of functions, A LOT! In fact, you may never use some functions under these libraries, not necessarily because you aren’t exploring these libraries enough, but because you might have some other way of achieving your goal without necessarily using these functions. I also remember when I started using these libraries for Data Science, it was difficult for me to remember where these functions were found and so most times I would just import all these popular libraries. But then, you don’t have to do that, be confident of where these functions come from and easily remember what they are used for. Functions are usually written as an extension from the library (e.g plt.plot), other times, a class is written before the function (e.g., np.random.randint). In the mentioned cases, above, “plot” is a function and “randint” is also a function. So let’s dive into it mentioning functions in these popular libraries.

Matplotlib:

This is used for visualization. Matplotlib is popularly represented with an alias “plt”. After this module is imported, it is given the alias “plt”. Hence, instead of writing Matplotlib in full subsequently while writing codes, plt can be used instead.

Some functions popularly used under Matplotlib are as follows. I would give detailed samples of how these are used in subsequent articles.

1. plot: plots a line graph

2. subplot: this adds a subplot to a figure

3. xscale: this sets the x axis scale

4. yscale: this sets the y axis scale

5. title: this creates a title for the graph

6. show: this displays a figure

7. xlabel: this labels the x axis

8. ylabel: this labels the y axis

9. figure: this creates a new figure

10. xlim: sets a limit to the x-axis

11. ylim: sets a limit to the y axis

12. bar: creates a bar graph

13. imsave: saves an image

14. figtext: adds a text to a figure

15. legend: creates a legend for the graph

16. lable: this adds a table to an axis

17. tick_params: this changes the properties of ticks, tick labels and grid lines

18. vlines: plots vertical lines

19. colorbar: this adds a colorbar to a plot

20. grid: this arranges the grid lines

21. plot_date: plots data that contains date

22. margins: sets an auto scale margin

23. arrow: adds an arrow to the axis

24. clf: closes a figure window

25. ioff: turns interactive mode off

Pandas:

The name Pandas comes from the term, Python for Data Analysis. Pandas is used for data manipulation, wrangling, merging, analysis and reshaping. Pandas is usually imported and given the alias “pd”. Pandas produces a 2 dimensional table object called a DataFrame.

Some functions found in the Pandas library are shown below:

1. read_csv: this imports a csv file

2. read_html(url): this extracts a table from a web page

3. to_csv: converts a result to a csv file

4. to_datetime: converts a column to date, time or duration format

5. series: creates a series

6. head: displays first top rows

7. tail: displays last bottom rows

8. shape: shows the dimensions of the DataFrame

9. info: displays concise summary of a DataFrame

10. value_counts: number of unique values

11. iloc: this does a selection by position

12. isnull: detects missing values

13. notnull: detects values not missing

14. dropna: drops missing rows

15. fillna: replaces null values

16. replace: replaces values with given values

17. sort_values: arranges values either in ascending or descending order

18. groupby: groups values from one column based on uniqueness

19. append: adds rows to the end of another dataframe.

20. concat: combines by attaching the rows of one DataFrame to another

21. mean: gives the mean value of a column

22. max: gives the highest values in a column

23. unique: returns the unique values of a series

24. min: gives the lowest value in a column of a DataFrame

25. count: displays number of not null rows of a DataFrame

Numpy :

Numerical Python (Numpy) is used for multi-dimensional arrays. This library contains mathematical functions that can be applied on arrays. Numpy is popularly written as an alias “np” within python codes.

There are many functions in Numpy and some of the frequently used functions are mentioned below:

1. loadtxt: loads data form a text file

2. savetxt: writes to a csv or text file

3. array: to create one or two dimensional arrays

4. arrange: creates an array with given range and step size

5. randint: generates an array of random integers

6. tolist: converts an array to list

7. sort: sorts an array

8. reshape: reshapes an array to a given number of rows and columns

9. split: splits array into a given number of array

10. add: adds a particular value to each element in an array

11. multiply: multiplies all elements in an array with a particular value

12. divide: divides all elements in an array with a particular value

13. ceil: rounds up an array to the nearest integer

14. floor: rounds down to the nearest integer

15. sum: sums up an array

16. min: gives the minimum value of an array

17. max: produces the maximum value of an array

18. std: gives the standard deviation of a particular column.

19. seed: this saves the state of a random function so it produces the same values when run over and over again.

20. dtype: returns the types of elements in an array

21. shape: returns the dimension of an array (that is the number of rows and number of column)

22. insert: inserts a value into a specified position

23. abs: produces the absolute values of elements in an array

24. ndim: produces the number of dimension of your array.

25. subtract: subtracts a particular elements from each element in an array.

These functions have structured ways in which they are used, alongside parameters.

But hopefully, you now have an idea of where you can find several functions in these popular libraries and what they are used for.

Go ahead and explore your popularly used libraries even more!

--

--