Tutorial Pandas (Python)#

[ ]:
import pandas as pd
import numpy as np
import string

Creating Dataframe#

Dataframe can be created through a dictionary data type that consist of several variables as columns. The format can be written as follow

df = pd.DataFrame({
  'column_1' : list of data,
  'column_2' : list of data
})

We will create a dataframe with the random data that generated by Numpy library.

[ ]:
# configuring the random seed number
np.random.seed(42)

# Get all lowercase and uppercase letters
alphabets = list(string.ascii_letters.upper())

# creating the data
df_1 = pd.DataFrame({
    'code' : np.random.choice(alphabets, size=100),
    'value_1' : np.random.randint(1, 100, 100) * 0.4,
    'value_2' : np.random.randint(1, 100, 100)
    })
[ ]:
# Show the first 5 data
df_1.head()
code value_1 value_2
0 M 18.0 24
1 Z 26.0 79
2 C 35.6 59
3 O 28.4 32
4 Q 3.6 96
[ ]:
# show the last 5 data
df_1.tail()
code value_1 value_2
95 Y 24.8 3
96 S 14.8 20
97 O 38.8 24
98 C 20.4 54
99 O 17.6 33

The shape method of dataframe will return the dimension of dataframe. The first value indicates the number of rows and the second column indicates the number of columns.

We can also create a dataframe from list data type. It must be configured as multiple lists that describes a column and row as follows

[ ]:
df_2 = pd.DataFrame(
    [['A', 98.6, 3],
     ['D', 76.4, 7],
     ['K', 67.4, 8],
     ['R', 70.0, 4]],
    columns=['code', 'value_1', 'value_2'])

df_2.head()
code value_1 value_2
0 A 98.6 3
1 D 76.4 7
2 K 67.4 8
3 R 70.0 4

Each value in a list represent as a column data and each list represent as a row.

Summarize Dataframe#

Once we have the dataframe (we could have huge dataframe in the real case), we want to get some general statistical information related to the dataframe. In Pandas, there are some methods that can be used to extract the statistical information as describe below

Describe Method#

Describe method is used to get a general statistic infotmation such as count (number of data), mean, std, min, and max for each feature (column) that has numerical data type. We can use .describe() syntax of a dataframe to show the information as follow

[ ]:
df_1.describe()
value_1 value_2
count 100.000000 100.000000
mean 19.552000 48.800000
std 12.017668 31.101658
min 0.400000 1.000000
25% 10.800000 20.000000
50% 18.000000 52.000000
75% 29.400000 73.500000
max 39.600000 99.000000

Shape Method#

Shape methos is the simple way to get dataframe dimension only such as number of row and column by using .shape syntax as follow

[ ]:
df_1.shape
(100, 3)

Based on the dataframe of df_1, we get the information that this dataframe has 100 rows and 3 columns.

Info Method#

Info method will give us the information related to data type (int, float, object) and number of data for each column. We can directly know if there are some features that have empty value or data. We can apply this method by using .info() as follow

[ ]:
df_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   code     100 non-null    object
 1   value_1  100 non-null    float64
 2   value_2  100 non-null    int64
dtypes: float64(1), int64(1), object(1)
memory usage: 2.5+ KB

Fundamental Statistical Method#

Pandas provide some fundamental statistical method for numerical feature such as sum, mean, min, max, std, var, quantile, median, and count. We can apply this method as follow

Sum Method

[ ]:
df_1_sum = df_1[['value_1', 'value_2']].sum()
df_1_sum
0
value_1 1955.2
value_2 4880.0

Mean Method

[ ]:
df_1_mean = df_1[['value_1', 'value_2']].mean()
df_1_mean
0
value_1 19.552
value_2 48.800

Max Method

[ ]:
df_1_max = df_1[['value_1', 'value_2']].max()
df_1_max
0
value_1 39.6
value_2 99.0

Standar Deviation Method

[ ]:
df_1_std = df_1[['value_1', 'value_2']].std()
df_1_std
0
value_1 12.017668
value_2 31.101658

Variance Method

[ ]:
df_1_var = df_1[['value_1', 'value_2']].var()
df_1_var
0
value_1 144.424339
value_2 967.313131

Quantile Method

We calculate the quantle for 25%, 50%, and 75%.

[ ]:
df_1_quantile = df_1[['value_1', 'value_2']].quantile([0.25, 0.5, 0.75])
df_1_quantile
value_1 value_2
0.25 10.8 20.0
0.50 18.0 52.0
0.75 29.4 73.5

Median Method

[ ]:
de_1_median = df_1[['value_1', 'value_2']].median()
de_1_median
0
value_1 18.0
value_2 52.0

Value Counts Method#

In particular case, we need to identify a number of categorical data of a feature name. For instance, we have code feature in df_1 and we wonder, how many code of each value in dataframe? We can get this information by using .value_counts() and assign the reult into df_code_count as follow

[ ]:
df_code_count = df_1['code'].value_counts()
df_code_count
count
code
U 8
R 8
N 7
B 6
X 6
Y 5
O 5
C 5
Z 5
H 4
P 4
D 4
J 4
G 3
M 3
K 3
S 3
W 3
I 3
V 3
Q 2
L 2
A 1
T 1
F 1
E 1

Unique Method and Nunique Method#

The unique method is used to identify the unique value in a feature name and nunique method is used to counted a total number of unique value if a feature name

[ ]:
df_unique = df_1['code'].unique()
df_unique
array(['M', 'Z', 'C', 'O', 'Q', 'H', 'U', 'S', 'W', 'K', 'X', 'J', 'N',
       'V', 'B', 'R', 'D', 'L', 'G', 'Y', 'A', 'P', 'I', 'T', 'F', 'E'],
      dtype=object)
[ ]:
df_nunique = df_1['code'].nunique()
df_nunique
26

Making a New Column#

In the middle of data anlysis, we will often to add a new column in a existing dataframe. For instance, the new column is generated from a certain calculation. Now, we have dataframe of df_1 that consists of features of code, value_1 and value_2. We want to add new feature (column) called value_3. This column is the multiplication between value_1 and value_2. It can be done by using Assign method and Manual method.

Assign Method#

We use .assign() method to add a new column to the current dataframe as follow

[ ]:
df_1 = df_1.assign(value_3 = df_1['value_1'] * df_1['value_2'])
df_1.head()
code value_1 value_2 value_3
0 M 18.0 24 432.0
1 Z 26.0 79 2054.0
2 C 35.6 59 2100.4
3 O 28.4 32 908.8
4 Q 3.6 96 345.6

Manual Method#

In manual method, we can add a new column by defining the new name of column in dataframe variable and assign new value into it as follow

df_1['new_col_name'] = value

we create a new column name as value_4 that apply division between value_1 and value_2.

[ ]:
# df_1['value_4'] = df_1['value_1'] / df_1['value_2'] # first type
df_1['value_4'] = df_1.value_1 / df_1.value_2 # second type
df_1.head()
code value_1 value_2 value_3 value_4
0 M 18.0 24 432.0 0.750000
1 Z 26.0 79 2054.0 0.329114
2 C 35.6 59 2100.4 0.603390
3 O 28.4 32 908.8 0.887500
4 Q 3.6 96 345.6 0.037500

Reshaping and Organizing Dataframe#

Concat#

We already have two datasets of dataframe, df_1 and df_2. We can combine the second dataframe into the first dataframe by using this method. It will be extended from the last row of the first dataframe. We will save the new dataframe into df_combine

[ ]:
df_combine = pd.concat([df_1, df_2])
df_combine.tail() # checking the updated dataframe at the last 5 data
code value_1 value_2 value_3 value_4
99 O 17.6 33 580.8 0.533333
0 A 98.6 3 NaN NaN
1 D 76.4 7 NaN NaN
2 K 67.4 8 NaN NaN
3 R 70.0 4 NaN NaN

Reset Index#

As we can see, the dataframe of df_2 has been added into the dataframe of df_1. However, the index number in the new dataframe of df_combine still follows the origin dataframe format, where the df_1 has the index start from 0 - 99 and the df_2 has the index start from 0 - 3. We can readjust the index number using reset_index method. We assign the attribute of inplace=True in the reset_index method to overide the data change of dataframe with the same name variable. If we do not write the attribute of inplace=True, the change will not be save unless you assign into new variable name.

[ ]:
df_combine.reset_index(inplace=True)
df_combine.tail()
index code value_1 value_2 value_3 value_4
99 99 O 17.6 33 580.8 0.533333
100 0 A 98.6 3 NaN NaN
101 1 D 76.4 7 NaN NaN
102 2 K 67.4 8 NaN NaN
103 3 R 70.0 4 NaN NaN

Drop#

We have just applied the reset_index method for the dataframe as shown in dataframe above. We can see that there is the old index column that has been added into the dataframe. In this case, the old index column is not required, We then can drop the old index column by using drop method with some attributes axis=1 and inplace=True. The axis=1 indicates the data is in the column

[ ]:
df_combine.drop(columns='index', axis=1, inplace=True)
df_combine.tail()
code value_1 value_2 value_3 value_4
99 O 17.6 33 580.8 0.533333
100 A 98.6 3 NaN NaN
101 D 76.4 7 NaN NaN
102 K 67.4 8 NaN NaN
103 R 70.0 4 NaN NaN

If we want to delete some columns, we can add multiple column names into the method as follow

df_combine.drop(columns=['col_name_1','col_name_2'], axis=1, inplace=True)

Rename#

We can rename the column name by using rename method. For instance, we want to rename value_1 to score_1 and value_2 to score_2

[ ]:
df_combine.rename(columns={'value_1' : 'score_1', 'value_2' : 'score_2'}, inplace=True)
df_combine.head()
code score_1 score_2 value_3 value_4
0 M 18.0 24 432.0 0.750000
1 Z 26.0 79 2054.0 0.329114
2 C 35.6 59 2100.4 0.603390
3 O 28.4 32 908.8 0.887500
4 Q 3.6 96 345.6 0.037500

Sorting Data#

Sometimes, we need to sort the data either ascendingly (lower to higher) or descendingly (higher to lower). In dataframe format, we can done this by using sort_values method with attribute name ascending. The attribute name of acending will take True for ascending and False for descending. By default, the attibute will be True. For instance, we can sort the data based on score_1 ascendingly and save into new datafrmae of df_combine_ascending.

[ ]:
df_combine_ascending = df_combine.sort_values(by='score_1', ascending=True).reset_index(drop=True)
df_combine_ascending.head()
code score_1 score_2 value_3 value_4
0 U 0.4 52 20.8 0.007692
1 C 0.4 69 27.6 0.005797
2 W 0.4 12 4.8 0.033333
3 Z 1.2 24 28.8 0.050000
4 S 1.2 9 10.8 0.133333

The score_1 data has been sorted ascendingly. You may notice that in the code above, we add reset_index method with attribute of drop=True. This indicates to reset the index number to the right order and drop the old index column directly through this command.

Now, we try to sort the score_2 descendingly and save it into new variable name of df_combine_descending.

[ ]:
df_combine_descending = df_combine.sort_values(by='score_2', ascending=False).reset_index(drop=True)
df_combine_descending.head()
code score_1 score_2 value_3 value_4
0 J 33.6 99 3326.4 0.339394
1 R 28.0 99 2772.0 0.282828
2 K 2.0 98 196.0 0.020408
3 B 13.6 97 1319.2 0.140206
4 M 31.6 97 3065.2 0.325773

Groupby#

Sometime, we want to collective the data based on the same category from a certain column and calculate its aggregate (e.g., mean, sum, etc). We can done this by using groupby method of dataframe. For instance, we want to grouping the data based on the code in the df_combine as follow

[ ]:
df_groupby = df_combine.groupby(by='code').mean()
df_groupby.head()
score_1 score_2 value_3 value_4
code
A 58.100000 36.500000 1232.000000 0.251429
B 18.333333 42.666667 894.133333 1.518709
C 16.240000 58.200000 897.520000 0.295373
D 35.440000 48.600000 1712.000000 3.104964
E 30.000000 90.000000 2700.000000 0.333333

Based on the code above, we have applied groupby method with aggregate of mean for all columns. This result gives the result where the code column as the index.

In case you want to calculate different aggregate method for specific column, we can apply .agg() method along with groupby method. For example, the score_1 will be aggregated by its mean and score_2 will be aggregated by its sum for each grouped code and save it into df_groupby_multi_agg.

[ ]:
df_groupby_multi_agg = df_combine.groupby(by='code').agg({'score_1':'mean', 'score_2':'sum'})
df_groupby_multi_agg.head()
score_1 score_2
code
A 58.100000 73
B 18.333333 256
C 16.240000 291
D 35.440000 243
E 30.000000 90

After grouping, we can check the dimension of new dataframe of df_groupby_multi_agg that has been reshaped, which is consist of 26 rows and 2 columns, where the code is the index.

[ ]:
df_groupby_multi_agg.shape
(26, 2)

Selecting Feature(s)#

We can create a new dataframe from selecting certain feature name(s). For instance, we have df_groupby_multi_agg that has feature’s name of score_1 and score_2. We can select for score_1 only as follow

[ ]:
df_score1 = df_groupby_multi_agg['score_1']
df_score1.head()
score_1
code
A 58.100000
B 18.333333
C 16.240000
D 35.440000
E 30.000000

There is another method to get a single feature name by calling the it as a method

dataframe.feature_name

We can see the example as follow

[ ]:
df_score1_method2 = df_groupby_multi_agg.score_1
df_score1_method2.head()
score_1
code
A 58.100000
B 18.333333
C 16.240000
D 35.440000
E 30.000000

Sometimes, we want to get more than one feature’s name. We can write the freature’s names that we want to extract within a list as follow

df_new = dataframe.[['feature1', 'feature2', 'feature3']]

and assign to new variable of df_new for instance.

We can also access the row data of dataframe by using index (default is number). To do this, we can use .iloc for the index nuumber or .loc for the non index number. For instance, we want to get the value of score_1 that belongs to C.

[ ]:
df_score1.loc['C']
np.float64(16.240000000000002)

If we use .iloc['C'] for the above case, it will rise an error because the index is not a number as follow

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipython-input-1732007559.py in <cell line: 0>()
----> 1 df_score1.iloc['C']

1 frames
/usr/local/lib/python3.12/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1747             key = item_from_zerodim(key)
   1748             if not is_integer(key):
-> 1749                 raise TypeError("Cannot index by location index with a non-integer key")
   1750
   1751             # validate the location

TypeError: Cannot index by location index with a non-integer key

We can use the integer number that corresponding to the C which is equal 2

[ ]:
df_score1.iloc[2]
np.float64(16.240000000000002)

We also can access for certain dataframe based on the range number of row by using .iloc[start:end]. We apply this method for dataframe of df_score1 to get a new dataframe from row 10 to 15.

[ ]:
df_score1_get_row = df_score1.iloc[10:15]
df_score1_get_row
score_1
code
K 26.550000
L 7.000000
M 17.600000
N 22.685714
O 29.280000

If we want to extract the new dataframe just from certain row index number and get all the data from the rest of rows, we can apply the index by using .iloc[10:]. The [10:] means that the data will be extracted from row of 10 to the end of original dataframe row.

[ ]:
df_score1_get_row = df_score1.iloc[10:]
df_score1_get_row
score_1
code
K 26.550000
L 7.000000
M 17.600000
N 22.685714
O 29.280000
P 24.300000
Q 8.200000
R 24.355556
S 17.066667
T 38.400000
U 20.850000
V 26.000000
W 21.733333
X 12.333333
Y 22.240000
Z 16.080000

Query Dataframe#

The dataframe has the flexibility to filter the data by using query and loc methods. For method loc has been explained partially in above, then we will explain more detail for this case.

Query Method#

We back to use the dataframe of df_combine for this exmple. Let’s say we want to filter the data based on score_1 that has value greater than 20, this can be applied as follow

[ ]:
df_score1_greater_than_20 = df_combine.query('score_1 > 20').reset_index(drop=True)
df_score1_greater_than_20.head()
code score_1 score_2 value_3 value_4
0 Z 26.0 79 2054.0 0.329114
1 C 35.6 59 2100.4 0.603390
2 O 28.4 32 908.8 0.887500
3 H 35.2 88 3097.6 0.400000
4 S 35.2 58 2041.6 0.606897
[ ]:
df_score1_greater_than_20.shape
(51, 5)

we keep the .reset_index(drop=True) method to reset the index and delete the old index feature. Therefore, we have 51 data that has value of score_1 greater than 20.

Sometimes, we need to filter the data based on multiple features criteria. For instance, we filter the data where score_1 is greater than 20 and score_2 is less than 5. Here, we can apply this by using the query method and assign a new dataframe to variable of df_filter_multiple_criteria as follow

[ ]:
df_filter_multiple_criteria = df_combine.query('score_1 > 20 and score_2 < 5').reset_index(drop=True)
df_filter_multiple_criteria.head()
code score_1 score_2 value_3 value_4
0 I 25.2 3 75.6 8.400000
1 O 35.6 2 71.2 17.800000
2 H 24.8 1 24.8 24.800000
3 Y 24.8 3 74.4 8.266667
4 A 98.6 3 NaN NaN
[ ]:
df_filter_multiple_criteria.shape
(6, 5)

Now, we have 6 row data of filtered dataframe.

Loc Method#

We can use loc method to filter the dataframe based on the criteria of the feature. We use the same example as the above

[ ]:
df_score1_greater_than_20_loc = df_combine.loc[df_combine['score_1'] > 20].reset_index(drop=True)
df_score1_greater_than_20_loc.head()
code score_1 score_2 value_3 value_4
0 Z 26.0 79 2054.0 0.329114
1 C 35.6 59 2100.4 0.603390
2 O 28.4 32 908.8 0.887500
3 H 35.2 88 3097.6 0.400000
4 S 35.2 58 2041.6 0.606897

also we can apply for multiple conditions

[ ]:
df_filter_multiple_criteria_loc = df_combine.loc[(df_combine['score_1'] > 20) & (df_combine['score_2'] < 5)].reset_index(drop=True)
df_filter_multiple_criteria_loc.head()
code score_1 score_2 value_3 value_4
0 I 25.2 3 75.6 8.400000
1 O 35.6 2 71.2 17.800000
2 H 24.8 1 24.8 24.800000
3 Y 24.8 3 74.4 8.266667
4 A 98.6 3 NaN NaN

Merging Dataframe#

Pandas has the flexibility method to merge between two dataframes based on a certain feature. It is something like a join table that should have the same key in both dataframes if the table need to be joined. The merge process in dataframe can be done by using .merge() method that has two main attributes, on and how. The attribute on is a key feature that will be used to connect on, and how is the method that should be applied to merge, either inner, outer, left or right.

  • inner: Combining two dataframes that have the overlaped feature values from both dataframes

  • outer: The opposite of inner method. Combining two dataframes that have the non-overlaped feature values from both dataframes

  • left: Combining two dataframes based on the feature value of the left side dataframe

  • right: Combining two dataframes based on the feature value of the right side dataframe

For instance, we create a new dataframe df_3 that consist of 50 row data as follow

[ ]:
# configuring the random seed number
np.random.seed(42)

# Get all lowercase and uppercase letters
alphabets = list(string.ascii_letters.upper())

# creating the data
df_3 = pd.DataFrame({
    'code' : np.random.choice(alphabets, size=50),
    'value_5' : np.random.randint(1, 50, 50) * 0.2
    })
[ ]:
df_3.describe()
value_5
count 50.000000
mean 4.960000
std 2.946045
min 0.200000
25% 2.800000
50% 4.900000
75% 7.800000
max 9.600000

The df_3 will be merged into df_combine. The merging process will be jioned based on code with inner join.

[ ]:
df_merged = df_combine.merge(df_3, on='code', how='inner')
df_merged.head()
code score_1 score_2 value_3 value_4 value_5
0 M 18.0 24 432.0 0.750000 2.8
1 M 18.0 24 432.0 0.750000 1.4
2 M 18.0 24 432.0 0.750000 5.8
3 Z 26.0 79 2054.0 0.329114 1.8
4 Z 26.0 79 2054.0 0.329114 3.6
[ ]:
df_merged.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   code     245 non-null    object
 1   score_1  245 non-null    float64
 2   score_2  245 non-null    int64
 3   value_3  235 non-null    float64
 4   value_4  235 non-null    float64
 5   value_5  245 non-null    float64
dtypes: float64(4), int64(1), object(1)
memory usage: 11.6+ KB

As we can see in the dataframe info above. In total, there are 245 data. However, the feature of value_3 and value_4 only have 235 data, which means there are 10 data are missing or NaN. There are several methods to solve this problem, 1) delete all rows that have missing value or NaN by using .dropna() method, 2) replace the missing value or NaN by particular value from a certain calculation or any consideration by using .fillna() method.

[ ]:
df_merged.tail()
code score_1 score_2 value_3 value_4 value_5
240 K 67.4 8 NaN NaN 4.8
241 R 70.0 4 NaN NaN 0.8
242 R 70.0 4 NaN NaN 2.8
243 R 70.0 4 NaN NaN 9.0
244 R 70.0 4 NaN NaN 3.0

Dropna Method#

[ ]:
df_merged_drop = df_merged.dropna()
df_merged_drop.info()
<class 'pandas.core.frame.DataFrame'>
Index: 235 entries, 0 to 234
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   code     235 non-null    object
 1   score_1  235 non-null    float64
 2   score_2  235 non-null    int64
 3   value_3  235 non-null    float64
 4   value_4  235 non-null    float64
 5   value_5  235 non-null    float64
dtypes: float64(4), int64(1), object(1)
memory usage: 12.9+ KB

Fillna Method#

For instance, we can replace the missing value or NaN by 0

[ ]:
df_merged_fillna = df_merged.fillna(0)
df_merged_fillna.tail()
code score_1 score_2 value_3 value_4 value_5
240 K 67.4 8 0.0 0.0 4.8
241 R 70.0 4 0.0 0.0 0.8
242 R 70.0 4 0.0 0.0 2.8
243 R 70.0 4 0.0 0.0 9.0
244 R 70.0 4 0.0 0.0 3.0

Saving to a File Format#

Dataframe can be saved into several types of format file, such as csv, xlsx (excel), json, and other types. The three format files that have been mentioned is the common format file used. Now, we have the df_merged_fillna variable that has a dataframe type.

CSV

We can save this data to the csv format file by using .to_csv() method. This method require two attributes (minimum); path and index. The path is the location or directory where the file will be saved and index is an information that the file will be saved including index number (True) or without index number (False).

[ ]:
df_merged_fillna.to_csv("example_data.csv", index=False)

Excel

[ ]:
df_merged_fillna.to_excel("example_data.xlsx", index=False)