Tutorial Pandas (Python)#
[ ]:
import pandas as pd
import numpy as np
import string
Creating Dataframe#
Dataframe can be created through a dictionary data type that consist of several variables as columns. The format can be written as follow
df = pd.DataFrame({
'column_1' : list of data,
'column_2' : list of data
})
We will create a dataframe with the random data that generated by Numpy library.
[ ]:
# configuring the random seed number
np.random.seed(42)
# Get all lowercase and uppercase letters
alphabets = list(string.ascii_letters.upper())
# creating the data
df_1 = pd.DataFrame({
'code' : np.random.choice(alphabets, size=100),
'value_1' : np.random.randint(1, 100, 100) * 0.4,
'value_2' : np.random.randint(1, 100, 100)
})
[ ]:
# Show the first 5 data
df_1.head()
| code | value_1 | value_2 | |
|---|---|---|---|
| 0 | M | 18.0 | 24 |
| 1 | Z | 26.0 | 79 |
| 2 | C | 35.6 | 59 |
| 3 | O | 28.4 | 32 |
| 4 | Q | 3.6 | 96 |
[ ]:
# show the last 5 data
df_1.tail()
| code | value_1 | value_2 | |
|---|---|---|---|
| 95 | Y | 24.8 | 3 |
| 96 | S | 14.8 | 20 |
| 97 | O | 38.8 | 24 |
| 98 | C | 20.4 | 54 |
| 99 | O | 17.6 | 33 |
The shape method of dataframe will return the dimension of dataframe. The first value indicates the number of rows and the second column indicates the number of columns.
We can also create a dataframe from list data type. It must be configured as multiple lists that describes a column and row as follows
[ ]:
df_2 = pd.DataFrame(
[['A', 98.6, 3],
['D', 76.4, 7],
['K', 67.4, 8],
['R', 70.0, 4]],
columns=['code', 'value_1', 'value_2'])
df_2.head()
| code | value_1 | value_2 | |
|---|---|---|---|
| 0 | A | 98.6 | 3 |
| 1 | D | 76.4 | 7 |
| 2 | K | 67.4 | 8 |
| 3 | R | 70.0 | 4 |
Each value in a list represent as a column data and each list represent as a row.
Summarize Dataframe#
Once we have the dataframe (we could have huge dataframe in the real case), we want to get some general statistical information related to the dataframe. In Pandas, there are some methods that can be used to extract the statistical information as describe below
Describe Method#
Describe method is used to get a general statistic infotmation such as count (number of data), mean, std, min, and max for each feature (column) that has numerical data type. We can use .describe() syntax of a dataframe to show the information as follow
[ ]:
df_1.describe()
| value_1 | value_2 | |
|---|---|---|
| count | 100.000000 | 100.000000 |
| mean | 19.552000 | 48.800000 |
| std | 12.017668 | 31.101658 |
| min | 0.400000 | 1.000000 |
| 25% | 10.800000 | 20.000000 |
| 50% | 18.000000 | 52.000000 |
| 75% | 29.400000 | 73.500000 |
| max | 39.600000 | 99.000000 |
Shape Method#
Shape methos is the simple way to get dataframe dimension only such as number of row and column by using .shape syntax as follow
[ ]:
df_1.shape
(100, 3)
Based on the dataframe of df_1, we get the information that this dataframe has 100 rows and 3 columns.
Info Method#
Info method will give us the information related to data type (int, float, object) and number of data for each column. We can directly know if there are some features that have empty value or data. We can apply this method by using .info() as follow
[ ]:
df_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 code 100 non-null object
1 value_1 100 non-null float64
2 value_2 100 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 2.5+ KB
Fundamental Statistical Method#
Pandas provide some fundamental statistical method for numerical feature such as sum, mean, min, max, std, var, quantile, median, and count. We can apply this method as follow
Sum Method
[ ]:
df_1_sum = df_1[['value_1', 'value_2']].sum()
df_1_sum
| 0 | |
|---|---|
| value_1 | 1955.2 |
| value_2 | 4880.0 |
Mean Method
[ ]:
df_1_mean = df_1[['value_1', 'value_2']].mean()
df_1_mean
| 0 | |
|---|---|
| value_1 | 19.552 |
| value_2 | 48.800 |
Max Method
[ ]:
df_1_max = df_1[['value_1', 'value_2']].max()
df_1_max
| 0 | |
|---|---|
| value_1 | 39.6 |
| value_2 | 99.0 |
Standar Deviation Method
[ ]:
df_1_std = df_1[['value_1', 'value_2']].std()
df_1_std
| 0 | |
|---|---|
| value_1 | 12.017668 |
| value_2 | 31.101658 |
Variance Method
[ ]:
df_1_var = df_1[['value_1', 'value_2']].var()
df_1_var
| 0 | |
|---|---|
| value_1 | 144.424339 |
| value_2 | 967.313131 |
Quantile Method
We calculate the quantle for 25%, 50%, and 75%.
[ ]:
df_1_quantile = df_1[['value_1', 'value_2']].quantile([0.25, 0.5, 0.75])
df_1_quantile
| value_1 | value_2 | |
|---|---|---|
| 0.25 | 10.8 | 20.0 |
| 0.50 | 18.0 | 52.0 |
| 0.75 | 29.4 | 73.5 |
Median Method
[ ]:
de_1_median = df_1[['value_1', 'value_2']].median()
de_1_median
| 0 | |
|---|---|
| value_1 | 18.0 |
| value_2 | 52.0 |
Value Counts Method#
In particular case, we need to identify a number of categorical data of a feature name. For instance, we have code feature in df_1 and we wonder, how many code of each value in dataframe? We can get this information by using .value_counts() and assign the reult into df_code_count as follow
[ ]:
df_code_count = df_1['code'].value_counts()
df_code_count
| count | |
|---|---|
| code | |
| U | 8 |
| R | 8 |
| N | 7 |
| B | 6 |
| X | 6 |
| Y | 5 |
| O | 5 |
| C | 5 |
| Z | 5 |
| H | 4 |
| P | 4 |
| D | 4 |
| J | 4 |
| G | 3 |
| M | 3 |
| K | 3 |
| S | 3 |
| W | 3 |
| I | 3 |
| V | 3 |
| Q | 2 |
| L | 2 |
| A | 1 |
| T | 1 |
| F | 1 |
| E | 1 |
Unique Method and Nunique Method#
The unique method is used to identify the unique value in a feature name and nunique method is used to counted a total number of unique value if a feature name
[ ]:
df_unique = df_1['code'].unique()
df_unique
array(['M', 'Z', 'C', 'O', 'Q', 'H', 'U', 'S', 'W', 'K', 'X', 'J', 'N',
'V', 'B', 'R', 'D', 'L', 'G', 'Y', 'A', 'P', 'I', 'T', 'F', 'E'],
dtype=object)
[ ]:
df_nunique = df_1['code'].nunique()
df_nunique
26
Making a New Column#
In the middle of data anlysis, we will often to add a new column in a existing dataframe. For instance, the new column is generated from a certain calculation. Now, we have dataframe of df_1 that consists of features of code, value_1 and value_2. We want to add new feature (column) called value_3. This column is the multiplication between value_1 and value_2. It can be done by using Assign method and Manual method.
Assign Method#
We use .assign() method to add a new column to the current dataframe as follow
[ ]:
df_1 = df_1.assign(value_3 = df_1['value_1'] * df_1['value_2'])
df_1.head()
| code | value_1 | value_2 | value_3 | |
|---|---|---|---|---|
| 0 | M | 18.0 | 24 | 432.0 |
| 1 | Z | 26.0 | 79 | 2054.0 |
| 2 | C | 35.6 | 59 | 2100.4 |
| 3 | O | 28.4 | 32 | 908.8 |
| 4 | Q | 3.6 | 96 | 345.6 |
Manual Method#
In manual method, we can add a new column by defining the new name of column in dataframe variable and assign new value into it as follow
df_1['new_col_name'] = value
we create a new column name as value_4 that apply division between value_1 and value_2.
[ ]:
# df_1['value_4'] = df_1['value_1'] / df_1['value_2'] # first type
df_1['value_4'] = df_1.value_1 / df_1.value_2 # second type
df_1.head()
| code | value_1 | value_2 | value_3 | value_4 | |
|---|---|---|---|---|---|
| 0 | M | 18.0 | 24 | 432.0 | 0.750000 |
| 1 | Z | 26.0 | 79 | 2054.0 | 0.329114 |
| 2 | C | 35.6 | 59 | 2100.4 | 0.603390 |
| 3 | O | 28.4 | 32 | 908.8 | 0.887500 |
| 4 | Q | 3.6 | 96 | 345.6 | 0.037500 |
Reshaping and Organizing Dataframe#
Concat#
We already have two datasets of dataframe, df_1 and df_2. We can combine the second dataframe into the first dataframe by using this method. It will be extended from the last row of the first dataframe. We will save the new dataframe into df_combine
[ ]:
df_combine = pd.concat([df_1, df_2])
df_combine.tail() # checking the updated dataframe at the last 5 data
| code | value_1 | value_2 | value_3 | value_4 | |
|---|---|---|---|---|---|
| 99 | O | 17.6 | 33 | 580.8 | 0.533333 |
| 0 | A | 98.6 | 3 | NaN | NaN |
| 1 | D | 76.4 | 7 | NaN | NaN |
| 2 | K | 67.4 | 8 | NaN | NaN |
| 3 | R | 70.0 | 4 | NaN | NaN |
Reset Index#
As we can see, the dataframe of df_2 has been added into the dataframe of df_1. However, the index number in the new dataframe of df_combine still follows the origin dataframe format, where the df_1 has the index start from 0 - 99 and the df_2 has the index start from 0 - 3. We can readjust the index number using reset_index method. We assign the attribute of inplace=True in the reset_index method to overide the data change of dataframe with the same name
variable. If we do not write the attribute of inplace=True, the change will not be save unless you assign into new variable name.
[ ]:
df_combine.reset_index(inplace=True)
df_combine.tail()
| index | code | value_1 | value_2 | value_3 | value_4 | |
|---|---|---|---|---|---|---|
| 99 | 99 | O | 17.6 | 33 | 580.8 | 0.533333 |
| 100 | 0 | A | 98.6 | 3 | NaN | NaN |
| 101 | 1 | D | 76.4 | 7 | NaN | NaN |
| 102 | 2 | K | 67.4 | 8 | NaN | NaN |
| 103 | 3 | R | 70.0 | 4 | NaN | NaN |
Drop#
We have just applied the reset_index method for the dataframe as shown in dataframe above. We can see that there is the old index column that has been added into the dataframe. In this case, the old index column is not required, We then can drop the old index column by using drop method with some attributes axis=1 and inplace=True. The axis=1 indicates the data is in the column
[ ]:
df_combine.drop(columns='index', axis=1, inplace=True)
df_combine.tail()
| code | value_1 | value_2 | value_3 | value_4 | |
|---|---|---|---|---|---|
| 99 | O | 17.6 | 33 | 580.8 | 0.533333 |
| 100 | A | 98.6 | 3 | NaN | NaN |
| 101 | D | 76.4 | 7 | NaN | NaN |
| 102 | K | 67.4 | 8 | NaN | NaN |
| 103 | R | 70.0 | 4 | NaN | NaN |
If we want to delete some columns, we can add multiple column names into the method as follow
df_combine.drop(columns=['col_name_1','col_name_2'], axis=1, inplace=True)
Rename#
We can rename the column name by using rename method. For instance, we want to rename value_1 to score_1 and value_2 to score_2
[ ]:
df_combine.rename(columns={'value_1' : 'score_1', 'value_2' : 'score_2'}, inplace=True)
df_combine.head()
| code | score_1 | score_2 | value_3 | value_4 | |
|---|---|---|---|---|---|
| 0 | M | 18.0 | 24 | 432.0 | 0.750000 |
| 1 | Z | 26.0 | 79 | 2054.0 | 0.329114 |
| 2 | C | 35.6 | 59 | 2100.4 | 0.603390 |
| 3 | O | 28.4 | 32 | 908.8 | 0.887500 |
| 4 | Q | 3.6 | 96 | 345.6 | 0.037500 |
Sorting Data#
Sometimes, we need to sort the data either ascendingly (lower to higher) or descendingly (higher to lower). In dataframe format, we can done this by using sort_values method with attribute name ascending. The attribute name of acending will take True for ascending and False for descending. By default, the attibute will be True. For instance, we can sort the data based on score_1 ascendingly and save into new datafrmae of df_combine_ascending.
[ ]:
df_combine_ascending = df_combine.sort_values(by='score_1', ascending=True).reset_index(drop=True)
df_combine_ascending.head()
| code | score_1 | score_2 | value_3 | value_4 | |
|---|---|---|---|---|---|
| 0 | U | 0.4 | 52 | 20.8 | 0.007692 |
| 1 | C | 0.4 | 69 | 27.6 | 0.005797 |
| 2 | W | 0.4 | 12 | 4.8 | 0.033333 |
| 3 | Z | 1.2 | 24 | 28.8 | 0.050000 |
| 4 | S | 1.2 | 9 | 10.8 | 0.133333 |
The score_1 data has been sorted ascendingly. You may notice that in the code above, we add reset_index method with attribute of drop=True. This indicates to reset the index number to the right order and drop the old index column directly through this command.
Now, we try to sort the score_2 descendingly and save it into new variable name of df_combine_descending.
[ ]:
df_combine_descending = df_combine.sort_values(by='score_2', ascending=False).reset_index(drop=True)
df_combine_descending.head()
| code | score_1 | score_2 | value_3 | value_4 | |
|---|---|---|---|---|---|
| 0 | J | 33.6 | 99 | 3326.4 | 0.339394 |
| 1 | R | 28.0 | 99 | 2772.0 | 0.282828 |
| 2 | K | 2.0 | 98 | 196.0 | 0.020408 |
| 3 | B | 13.6 | 97 | 1319.2 | 0.140206 |
| 4 | M | 31.6 | 97 | 3065.2 | 0.325773 |
Groupby#
Sometime, we want to collective the data based on the same category from a certain column and calculate its aggregate (e.g., mean, sum, etc). We can done this by using groupby method of dataframe. For instance, we want to grouping the data based on the code in the df_combine as follow
[ ]:
df_groupby = df_combine.groupby(by='code').mean()
df_groupby.head()
| score_1 | score_2 | value_3 | value_4 | |
|---|---|---|---|---|
| code | ||||
| A | 58.100000 | 36.500000 | 1232.000000 | 0.251429 |
| B | 18.333333 | 42.666667 | 894.133333 | 1.518709 |
| C | 16.240000 | 58.200000 | 897.520000 | 0.295373 |
| D | 35.440000 | 48.600000 | 1712.000000 | 3.104964 |
| E | 30.000000 | 90.000000 | 2700.000000 | 0.333333 |
Based on the code above, we have applied groupby method with aggregate of mean for all columns. This result gives the result where the code column as the index.
In case you want to calculate different aggregate method for specific column, we can apply .agg() method along with groupby method. For example, the score_1 will be aggregated by its mean and score_2 will be aggregated by its sum for each grouped code and save it into df_groupby_multi_agg.
[ ]:
df_groupby_multi_agg = df_combine.groupby(by='code').agg({'score_1':'mean', 'score_2':'sum'})
df_groupby_multi_agg.head()
| score_1 | score_2 | |
|---|---|---|
| code | ||
| A | 58.100000 | 73 |
| B | 18.333333 | 256 |
| C | 16.240000 | 291 |
| D | 35.440000 | 243 |
| E | 30.000000 | 90 |
After grouping, we can check the dimension of new dataframe of df_groupby_multi_agg that has been reshaped, which is consist of 26 rows and 2 columns, where the code is the index.
[ ]:
df_groupby_multi_agg.shape
(26, 2)
Selecting Feature(s)#
We can create a new dataframe from selecting certain feature name(s). For instance, we have df_groupby_multi_agg that has feature’s name of score_1 and score_2. We can select for score_1 only as follow
[ ]:
df_score1 = df_groupby_multi_agg['score_1']
df_score1.head()
| score_1 | |
|---|---|
| code | |
| A | 58.100000 |
| B | 18.333333 |
| C | 16.240000 |
| D | 35.440000 |
| E | 30.000000 |
There is another method to get a single feature name by calling the it as a method
dataframe.feature_name
We can see the example as follow
[ ]:
df_score1_method2 = df_groupby_multi_agg.score_1
df_score1_method2.head()
| score_1 | |
|---|---|
| code | |
| A | 58.100000 |
| B | 18.333333 |
| C | 16.240000 |
| D | 35.440000 |
| E | 30.000000 |
Sometimes, we want to get more than one feature’s name. We can write the freature’s names that we want to extract within a list as follow
df_new = dataframe.[['feature1', 'feature2', 'feature3']]
and assign to new variable of df_new for instance.
We can also access the row data of dataframe by using index (default is number). To do this, we can use .iloc for the index nuumber or .loc for the non index number. For instance, we want to get the value of score_1 that belongs to C.
[ ]:
df_score1.loc['C']
np.float64(16.240000000000002)
If we use .iloc['C'] for the above case, it will rise an error because the index is not a number as follow
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipython-input-1732007559.py in <cell line: 0>()
----> 1 df_score1.iloc['C']
1 frames
/usr/local/lib/python3.12/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1747 key = item_from_zerodim(key)
1748 if not is_integer(key):
-> 1749 raise TypeError("Cannot index by location index with a non-integer key")
1750
1751 # validate the location
TypeError: Cannot index by location index with a non-integer key
We can use the integer number that corresponding to the C which is equal 2
[ ]:
df_score1.iloc[2]
np.float64(16.240000000000002)
We also can access for certain dataframe based on the range number of row by using .iloc[start:end]. We apply this method for dataframe of df_score1 to get a new dataframe from row 10 to 15.
[ ]:
df_score1_get_row = df_score1.iloc[10:15]
df_score1_get_row
| score_1 | |
|---|---|
| code | |
| K | 26.550000 |
| L | 7.000000 |
| M | 17.600000 |
| N | 22.685714 |
| O | 29.280000 |
If we want to extract the new dataframe just from certain row index number and get all the data from the rest of rows, we can apply the index by using .iloc[10:]. The [10:] means that the data will be extracted from row of 10 to the end of original dataframe row.
[ ]:
df_score1_get_row = df_score1.iloc[10:]
df_score1_get_row
| score_1 | |
|---|---|
| code | |
| K | 26.550000 |
| L | 7.000000 |
| M | 17.600000 |
| N | 22.685714 |
| O | 29.280000 |
| P | 24.300000 |
| Q | 8.200000 |
| R | 24.355556 |
| S | 17.066667 |
| T | 38.400000 |
| U | 20.850000 |
| V | 26.000000 |
| W | 21.733333 |
| X | 12.333333 |
| Y | 22.240000 |
| Z | 16.080000 |
Query Dataframe#
The dataframe has the flexibility to filter the data by using query and loc methods. For method loc has been explained partially in above, then we will explain more detail for this case.
Query Method#
We back to use the dataframe of df_combine for this exmple. Let’s say we want to filter the data based on score_1 that has value greater than 20, this can be applied as follow
[ ]:
df_score1_greater_than_20 = df_combine.query('score_1 > 20').reset_index(drop=True)
df_score1_greater_than_20.head()
| code | score_1 | score_2 | value_3 | value_4 | |
|---|---|---|---|---|---|
| 0 | Z | 26.0 | 79 | 2054.0 | 0.329114 |
| 1 | C | 35.6 | 59 | 2100.4 | 0.603390 |
| 2 | O | 28.4 | 32 | 908.8 | 0.887500 |
| 3 | H | 35.2 | 88 | 3097.6 | 0.400000 |
| 4 | S | 35.2 | 58 | 2041.6 | 0.606897 |
[ ]:
df_score1_greater_than_20.shape
(51, 5)
we keep the .reset_index(drop=True) method to reset the index and delete the old index feature. Therefore, we have 51 data that has value of score_1 greater than 20.
Sometimes, we need to filter the data based on multiple features criteria. For instance, we filter the data where score_1 is greater than 20 and score_2 is less than 5. Here, we can apply this by using the query method and assign a new dataframe to variable of df_filter_multiple_criteria as follow
[ ]:
df_filter_multiple_criteria = df_combine.query('score_1 > 20 and score_2 < 5').reset_index(drop=True)
df_filter_multiple_criteria.head()
| code | score_1 | score_2 | value_3 | value_4 | |
|---|---|---|---|---|---|
| 0 | I | 25.2 | 3 | 75.6 | 8.400000 |
| 1 | O | 35.6 | 2 | 71.2 | 17.800000 |
| 2 | H | 24.8 | 1 | 24.8 | 24.800000 |
| 3 | Y | 24.8 | 3 | 74.4 | 8.266667 |
| 4 | A | 98.6 | 3 | NaN | NaN |
[ ]:
df_filter_multiple_criteria.shape
(6, 5)
Now, we have 6 row data of filtered dataframe.
Loc Method#
We can use loc method to filter the dataframe based on the criteria of the feature. We use the same example as the above
[ ]:
df_score1_greater_than_20_loc = df_combine.loc[df_combine['score_1'] > 20].reset_index(drop=True)
df_score1_greater_than_20_loc.head()
| code | score_1 | score_2 | value_3 | value_4 | |
|---|---|---|---|---|---|
| 0 | Z | 26.0 | 79 | 2054.0 | 0.329114 |
| 1 | C | 35.6 | 59 | 2100.4 | 0.603390 |
| 2 | O | 28.4 | 32 | 908.8 | 0.887500 |
| 3 | H | 35.2 | 88 | 3097.6 | 0.400000 |
| 4 | S | 35.2 | 58 | 2041.6 | 0.606897 |
also we can apply for multiple conditions
[ ]:
df_filter_multiple_criteria_loc = df_combine.loc[(df_combine['score_1'] > 20) & (df_combine['score_2'] < 5)].reset_index(drop=True)
df_filter_multiple_criteria_loc.head()
| code | score_1 | score_2 | value_3 | value_4 | |
|---|---|---|---|---|---|
| 0 | I | 25.2 | 3 | 75.6 | 8.400000 |
| 1 | O | 35.6 | 2 | 71.2 | 17.800000 |
| 2 | H | 24.8 | 1 | 24.8 | 24.800000 |
| 3 | Y | 24.8 | 3 | 74.4 | 8.266667 |
| 4 | A | 98.6 | 3 | NaN | NaN |
Merging Dataframe#
Pandas has the flexibility method to merge between two dataframes based on a certain feature. It is something like a join table that should have the same key in both dataframes if the table need to be joined. The merge process in dataframe can be done by using .merge() method that has two main attributes, on and how. The attribute on is a key feature that will be used to connect on, and how is the method that should be applied to merge, either inner, outer, left
or right.
inner: Combining two dataframes that have the overlaped feature values from both dataframes
outer: The opposite of inner method. Combining two dataframes that have the non-overlaped feature values from both dataframes
left: Combining two dataframes based on the feature value of the left side dataframe
right: Combining two dataframes based on the feature value of the right side dataframe
For instance, we create a new dataframe df_3 that consist of 50 row data as follow
[ ]:
# configuring the random seed number
np.random.seed(42)
# Get all lowercase and uppercase letters
alphabets = list(string.ascii_letters.upper())
# creating the data
df_3 = pd.DataFrame({
'code' : np.random.choice(alphabets, size=50),
'value_5' : np.random.randint(1, 50, 50) * 0.2
})
[ ]:
df_3.describe()
| value_5 | |
|---|---|
| count | 50.000000 |
| mean | 4.960000 |
| std | 2.946045 |
| min | 0.200000 |
| 25% | 2.800000 |
| 50% | 4.900000 |
| 75% | 7.800000 |
| max | 9.600000 |
The df_3 will be merged into df_combine. The merging process will be jioned based on code with inner join.
[ ]:
df_merged = df_combine.merge(df_3, on='code', how='inner')
df_merged.head()
| code | score_1 | score_2 | value_3 | value_4 | value_5 | |
|---|---|---|---|---|---|---|
| 0 | M | 18.0 | 24 | 432.0 | 0.750000 | 2.8 |
| 1 | M | 18.0 | 24 | 432.0 | 0.750000 | 1.4 |
| 2 | M | 18.0 | 24 | 432.0 | 0.750000 | 5.8 |
| 3 | Z | 26.0 | 79 | 2054.0 | 0.329114 | 1.8 |
| 4 | Z | 26.0 | 79 | 2054.0 | 0.329114 | 3.6 |
[ ]:
df_merged.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 code 245 non-null object
1 score_1 245 non-null float64
2 score_2 245 non-null int64
3 value_3 235 non-null float64
4 value_4 235 non-null float64
5 value_5 245 non-null float64
dtypes: float64(4), int64(1), object(1)
memory usage: 11.6+ KB
As we can see in the dataframe info above. In total, there are 245 data. However, the feature of value_3 and value_4 only have 235 data, which means there are 10 data are missing or NaN. There are several methods to solve this problem, 1) delete all rows that have missing value or NaN by using .dropna() method, 2) replace the missing value or NaN by particular value from a certain calculation or any consideration by using .fillna() method.
[ ]:
df_merged.tail()
| code | score_1 | score_2 | value_3 | value_4 | value_5 | |
|---|---|---|---|---|---|---|
| 240 | K | 67.4 | 8 | NaN | NaN | 4.8 |
| 241 | R | 70.0 | 4 | NaN | NaN | 0.8 |
| 242 | R | 70.0 | 4 | NaN | NaN | 2.8 |
| 243 | R | 70.0 | 4 | NaN | NaN | 9.0 |
| 244 | R | 70.0 | 4 | NaN | NaN | 3.0 |
Dropna Method#
[ ]:
df_merged_drop = df_merged.dropna()
df_merged_drop.info()
<class 'pandas.core.frame.DataFrame'>
Index: 235 entries, 0 to 234
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 code 235 non-null object
1 score_1 235 non-null float64
2 score_2 235 non-null int64
3 value_3 235 non-null float64
4 value_4 235 non-null float64
5 value_5 235 non-null float64
dtypes: float64(4), int64(1), object(1)
memory usage: 12.9+ KB
Fillna Method#
For instance, we can replace the missing value or NaN by 0
[ ]:
df_merged_fillna = df_merged.fillna(0)
df_merged_fillna.tail()
| code | score_1 | score_2 | value_3 | value_4 | value_5 | |
|---|---|---|---|---|---|---|
| 240 | K | 67.4 | 8 | 0.0 | 0.0 | 4.8 |
| 241 | R | 70.0 | 4 | 0.0 | 0.0 | 0.8 |
| 242 | R | 70.0 | 4 | 0.0 | 0.0 | 2.8 |
| 243 | R | 70.0 | 4 | 0.0 | 0.0 | 9.0 |
| 244 | R | 70.0 | 4 | 0.0 | 0.0 | 3.0 |
Saving to a File Format#
Dataframe can be saved into several types of format file, such as csv, xlsx (excel), json, and other types. The three format files that have been mentioned is the common format file used. Now, we have the df_merged_fillna variable that has a dataframe type.
CSV
We can save this data to the csv format file by using .to_csv() method. This method require two attributes (minimum); path and index. The path is the location or directory where the file will be saved and index is an information that the file will be saved including index number (True) or without index number (False).
[ ]:
df_merged_fillna.to_csv("example_data.csv", index=False)
Excel
[ ]:
df_merged_fillna.to_excel("example_data.xlsx", index=False)