A Brief Overview of the Groupby() and agg() methods in Pandas

Python has many libraries that make it easy for users to manipulate data in diverse ways, and Pandas is one of the libraries that help users achieve that. Pandas provide various functions for analysing, visualising and manipulating data.

The groupby () and agg() methods allow users to perform aggregation and group operations on data by providing different functionalities.

The groupby() method in pandas is used to split data sets into groups predicated on a specific column or multiple columns. By splitting data into groups users can apply various operations on groups.

By returning a GroupBy object, it can be used to perform various operations on the grouped data. The groupby() method is typically used in combination with other methods.

The syntax for the groupby() method is as follows:

grouped_data = dataframe.groupby(column_name)

The 'column_name' represents the column by which we want to group the data. After grouping the data, we can perform operations like aggregation functions on each group.

Aggregation in Pandas refers to the process of grouping data and applying a function to each group to produce a summary of the data.

Aggregation functions are used to combine the values within a group and return a single value as output. Common aggregation functions include sum (), mean (), count (), min (), max (), etc.

The aggregate () method in Pandas is used to apply one or more aggregation functions to the grouped data.

The aggregation functions can be built-in functions, such as sum (), mean (), and count (), or custom functions that you define.

The aggregate () method returns a DataFrame that contains the aggregated data.

Let us illustrate the workings of these functions with examples. We would be applying custom functions to our dataset.

Suppose we have a data set about meat. Yes, I know it is a wired choice but hey, it makes sense. It contains columns like meat_type, quantity and price.


import pandas as pd

data = {

'Meat_type':['Lamb casserole', 'Pork_root_casserole','Pork_root_casserole','Lamb casserole',

'Irish stew', 'Irish stew','Pork_root_casserole','Pork_root_casserole','Vienna_stake',

'Gammon_and_mushroom','Beef_roast','Beef_roast',

'Beef_roast','Beef_roast','Vienna_stake','Vienna_stake','Chicken breast',

'Chicken breast','Chicken breast','Chicken breast' ],

'Quantity':[70,78,78,70,89, 89,78,78,69,69,75,75,75,75,69,69,68,68,68,68],

'Price':[150, 162, 162,150,180,180,162,162,172,172,180,180,180,180,172,172,155,155,155,155,]

}


In the code above, I import the panda's library and define a dictionary called "data".

The dictionary contains three keys: "Meat_type", "Quantity", and "Price".

The "Meat_type" key contains a list of strings that represent different types of meat dishes.

The "Quantity" key contains a list of integers that represent the quantity of each meat dish.

The "Price" key contains a list of integers that represent the price of each meat dish.

Next, I create a dataframe and display the first five rows of the dataframe with the code below:


df = pd.DataFrame(data)

df.head(5)

Meat_type Quantity Price

0 Lamb casserole 70 150

1 Pork_root_casserole 78 162

2 Pork_root_casserole 78 162

3 Lamb casserole 70 150

4 Irish stew 89 180


#Defining a custom function

def func_mean(df):

return df['Quantity']. mean ()

group_meat = df.groupby('Meat_type')

group_meat_mean = group_meat.apply(func_mean)

group_meat_mean


#Defining a sum function

def fun_sum(df):

return df['Price']. sum ()

group_sum = df.groupby('Meat_type')

final_group_sum = group_sum.apply(fun_sum)

final_group_sum


def fun_count(df):

return df['Quantity'].count()

group_count = df.groupby('Meat_type')

final_group_count = group_count.apply(fun_count)

final_group_count


Meat_typebr>

Beef_roast 4

Chicken breast 4

Gammon_and_mushroom 1

Irish stew 2

Lamb casserole 2

Pork_root_casserole 4

Vienna_stake 3

dtype: int64


This is what is happening. I define three custom functions called fun_mean, fun_sum and fun_count that take a pandas DataFrame as input and return the sum of 'Price' column, the mean of the 'Quantity' column and the count of the 'Quantity' column.

The functions are then applied to a grouped DataFrame called group_sum, group_count, group_mean using the apply() method.

The resulting DataFrames, final_group_sum, final_group_mean and final_group_count contain the sum of the 'Price', the mean of the 'Quantity' column and the count of the 'Quantity' column for each group in the original DataFrame, grouped by the 'Meat_type' column.

The groupby() method is used to group the DataFrame by the 'Meat_type' column.

This creates a DataFrame GroupBy object that can be used to apply functions to each group.

The apply() method is then used to apply the fun_sum function to each group, which returns the sum, mean and count of the columns for each group.

The resulting DataFrame, final_group_sum, final_group_count, final_group_mean contains the sum, mean and count of the 'Price' and 'Quantity' column for each group in the original DataFrame.

In conclusion, the functions demonstrate how to use the groupby() and apply() methods in pandas to group a DataFrame by a column and apply the agg() function to each group. This technique can be used to perform various types of data analysis, such as calculating summary statistics or aggregating data.