Working with large datasets on pandas? Here some tips to boost performance.

While working on a machine learning or a data visualization project I believe python + pandas + jupyter is the best set of tools to work on POC or hypothesis to quickly validate. However, sometimes there is performance degradation with large datasets to which I would like to share some of the tips I use to speed up the process.

Data Set

I have chosen an existing dataset from Kaggle NYC Parking Tickets 2014 data to demonstrate what I am trying to convey.

  • CSV file size: 1.74 GB
  • No of Rows: 9M+
  • No of Columns: 51

1. Selective Data : Save RAM = more performance

With data of less than 500MB it usually doesn't matter as by default we read a CSV file as a whole but there is an option called nrows where we can specify no of rows to import. For example, if we want to read a CSV file with the first 10000 rows only then the syntax is as follows. Another option skiprows which allows you to skip rows. These options allow you to test your code on a smaller chunk of data multiple times rather than waiting for loading all the data

data_tickets = pd.read_csv('../input/Parking_Violations_Issued_-_Fiscal_Year_2014__August_2013___June_2014_.csv',nrows=10000,skiprows=1000)

These two options are helpful if you are just started working on the data and want to hit and try a few things.

2. Downcast : Use only that much you need

Downcast in simple terms means converting high storage consuming datatype to satisfying low storage consuming datatype. eg. float64 to int64. Let's say if we have a column which is currently stored as float64 but its values can be satisfied by int64 or lower(int8) then we should always downcast. Let's see how we do it.

Summons Number                         int64
Plate ID                              object
Registration State                    object
Plate Type                            object
Issue Date                            object
Unregistered Vehicle?                  int64
Vehicle Year                           int64
Meter Number                          object
Feet From Curb                         int64
Violation Post Code                  float64
Violation Description                float64
No Standing or Stopping Violation    float64
Hydrant Violation                    float64
Double Parking Violation             float64
Latitude                             float64
Longitude                            float64
Community Board                      float64
Community Council                    float64
Census Tract                         float64
BIN                                  float64
BBL                                  float64
NTA                                  float64
dtype: object

In the above example, we have Vehicle Year as int64 and as we know Year cannot be negative and since the data is of 2014 it won't have a value greater than 2014 so replacing it with uint16 is safe.

The CSV file size is 1.74GB but the memory used by pandas object will be 3.5GB+. using data_tickets. dtype we found out there are 24 columns using 64bit memory which can easily be converted to 16bit (uint16 or float16).

Lets Do some math.

int64 = 64bit
int16 = 16bit
for 9M records of int64 = 72MB
for 9M records of int16 = 18MB
1 column saves = 54MB
24 columns = 1.2GB

There are two ways to do this.

#1. converting all int columns at once
data_tickets_ints = data_tickets.select_dtypes(include=['int']) 
data_tickets_uints = data_tickets_ints.apply(pd.to_numeric,downcast='unsigned')

#2. one by one downcasting
data_tickets['Vehicle Year'] = data_tickets['Vehicle Year'].astype('uint16')

3. Object Optimization using Categorical.

Categorical is a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited and usually fixed, number of possible values. Examples Vehicle Body Type, Vehicle Make, Issuing Agency, Registration State, Plate Type, Street Name, and more. So in simple terms, we are converting objects of different lengths to fixed-length unsigned integers.

for dtype in ['float','int','object']:
    selected_dtype = data_tickets.select_dtypes(include=[dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))
Average memory usage for float columns: 65.57 MB                           
Average memory usage for int columns: 62.49 MB
Average memory usage for object columns: 476.72 MB

Now let's see how to do implement Categorical.

vehicle_make = data_tickets['Vehicle Make']
# 0     AUDI
# 1     FORD
# 2    CHEVR
# 3     FORD
# 4      GMC
# Name: Vehicle Make, dtype: object [533.27MB]
vehicle_make_cat = vehicle_make.astype('category')
# 0     AUDI
# 1     FORD
# 2    CHEVR
# 3     FORD
# 4      GMC
# Name: Vehicle Make, dtype: category [18MB]

4. Using del and gc.collect()

As datasets load into the RAM and do not free on its own and after making any change to our original dataset which will still be consuming memory. Therefore we should build the habit of deleting data frames when its no longer used.

import gc at the beginning of your project, and then each time you want to clear up space put command gc.collect()

del data_tickets