Overview

Dataset statistics

Number of variables11
Number of observations1025010
Missing cells0
Missing cells (%)0.0%
Duplicate rows2239
Duplicate rows (%)0.2%
Total size in memory86.0 MiB
Average record size in memory88.0 B

Variable types

NUM6
CAT5

Reproduction

Analysis started2020-08-24 23:49:49.446172
Analysis finished2020-08-24 23:50:43.382737
Duration53.94 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

Dataset has 2239 (0.2%) duplicate rows Duplicates
target has 513702 (50.1%) zeros Zeros

Variables

att_1
Categorical

Distinct count4
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size7.8 MiB
3
257150
1
256087
4
256077
2
255696
ValueCountFrequency (%) 
325715025.1%
 
125608725.0%
 
425607725.0%
 
225569624.9%
 
2020-08-24T23:50:47.920517image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Length

Max length3
Median length3
Mean length3
Min length3

Overview of Unicode Properties

Unique unicode characters6
Unique unicode categories (?)2
Unique unicode scripts (?)1
Unique unicode blocks (?)1
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Most occurring characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
32571508.4%
 
12560878.3%
 
42560778.3%
 
22556968.3%
 

Most occurring categories

ValueCountFrequency (%) 
Decimal Number205002066.7%
 
Other Punctuation102501033.3%
 

Most frequent Decimal Number characters

ValueCountFrequency (%) 
0102501050.0%
 
325715012.5%
 
125608712.5%
 
425607712.5%
 
225569612.5%
 

Most frequent Other Punctuation characters

ValueCountFrequency (%) 
.1025010100.0%
 

Most occurring scripts

ValueCountFrequency (%) 
Common3075030100.0%
 

Most frequent Common characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
32571508.4%
 
12560878.3%
 
42560778.3%
 
22556968.3%
 

Most occurring blocks

ValueCountFrequency (%) 
ASCII3075030100.0%
 

Most frequent ASCII characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
32571508.4%
 
12560878.3%
 
42560778.3%
 
22556968.3%
 

att_2
Real number (ℝ≥0)

Distinct count13
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean6.997861484278202
Minimum1.0
Maximum13.0
Zeros0
Zeros (%)0.0%
Memory size7.8 MiB
2020-08-24T23:50:48.050884image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile1
Q14
median7
Q310
95-th percentile13
Maximum13
Range12
Interquartile range (IQR)6

Descriptive statistics

Standard deviation3.743529466
Coefficient of variation (CV)0.5349533532
Kurtosis-1.21531586
Mean6.997861484
Median Absolute Deviation (MAD)3
Skewness0.0005632497179
Sum7172878
Variance14.01401286
2020-08-24T23:50:48.188737image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
1792347.7%
 
11791587.7%
 
6791427.7%
 
4790177.7%
 
12788587.7%
 
13788337.7%
 
2788187.7%
 
8787867.7%
 
5787697.7%
 
10787617.7%
 
3786907.7%
 
7785427.7%
 
9784027.6%
 
ValueCountFrequency (%) 
1792347.7%
 
2788187.7%
 
3786907.7%
 
4790177.7%
 
5787697.7%
 
6791427.7%
 
7785427.7%
 
8787867.7%
 
9784027.6%
 
10787617.7%
 
ValueCountFrequency (%) 
13788337.7%
 
12788587.7%
 
11791587.7%
 
10787617.7%
 
9784027.6%
 
8787867.7%
 
7785427.7%
 
6791427.7%
 
5787697.7%
 
4790177.7%
 

att_3
Categorical

Distinct count4
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size7.8 MiB
1
256671
4
256535
3
255943
2
255861
ValueCountFrequency (%) 
125667125.0%
 
425653525.0%
 
325594325.0%
 
225586125.0%
 
2020-08-24T23:50:52.833298image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Length

Max length3
Median length3
Mean length3
Min length3

Overview of Unicode Properties

Unique unicode characters6
Unique unicode categories (?)2
Unique unicode scripts (?)1
Unique unicode blocks (?)1
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Most occurring characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
12566718.3%
 
42565358.3%
 
32559438.3%
 
22558618.3%
 

Most occurring categories

ValueCountFrequency (%) 
Decimal Number205002066.7%
 
Other Punctuation102501033.3%
 

Most frequent Decimal Number characters

ValueCountFrequency (%) 
0102501050.0%
 
125667112.5%
 
425653512.5%
 
325594312.5%
 
225586112.5%
 

Most frequent Other Punctuation characters

ValueCountFrequency (%) 
.1025010100.0%
 

Most occurring scripts

ValueCountFrequency (%) 
Common3075030100.0%
 

Most frequent Common characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
12566718.3%
 
42565358.3%
 
32559438.3%
 
22558618.3%
 

Most occurring blocks

ValueCountFrequency (%) 
ASCII3075030100.0%
 

Most frequent ASCII characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
12566718.3%
 
42565358.3%
 
32559438.3%
 
22558618.3%
 

att_4
Real number (ℝ≥0)

Distinct count13
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean7.006294572735876
Minimum1.0
Maximum13.0
Zeros0
Zeros (%)0.0%
Memory size7.8 MiB
2020-08-24T23:50:52.968103image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile1
Q14
median7
Q310
95-th percentile13
Maximum13
Range12
Interquartile range (IQR)6

Descriptive statistics

Standard deviation3.744054308
Coefficient of variation (CV)0.5343843695
Kurtosis-1.215713979
Mean7.006294573
Median Absolute Deviation (MAD)3
Skewness-0.001749992216
Sum7181522
Variance14.01794266
2020-08-24T23:50:53.097326image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
11793047.7%
 
12792377.7%
 
13792067.7%
 
7790387.7%
 
3790207.7%
 
6789687.7%
 
8789147.7%
 
2787937.7%
 
1787387.7%
 
10785967.7%
 
4785227.7%
 
9783677.6%
 
5783077.6%
 
ValueCountFrequency (%) 
1787387.7%
 
2787937.7%
 
3790207.7%
 
4785227.7%
 
5783077.6%
 
6789687.7%
 
7790387.7%
 
8789147.7%
 
9783677.6%
 
10785967.7%
 
ValueCountFrequency (%) 
13792067.7%
 
12792377.7%
 
11793047.7%
 
10785967.7%
 
9783677.6%
 
8789147.7%
 
7790387.7%
 
6789687.7%
 
5783077.6%
 
4785227.7%
 

att_5
Categorical

Distinct count4
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size7.8 MiB
3
256901
4
256531
1
256331
2
255247
ValueCountFrequency (%) 
325690125.1%
 
425653125.0%
 
125633125.0%
 
225524724.9%
 
2020-08-24T23:50:57.823430image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Length

Max length3
Median length3
Mean length3
Min length3

Overview of Unicode Properties

Unique unicode characters6
Unique unicode categories (?)2
Unique unicode scripts (?)1
Unique unicode blocks (?)1
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Most occurring characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
32569018.4%
 
42565318.3%
 
12563318.3%
 
22552478.3%
 

Most occurring categories

ValueCountFrequency (%) 
Decimal Number205002066.7%
 
Other Punctuation102501033.3%
 

Most frequent Decimal Number characters

ValueCountFrequency (%) 
0102501050.0%
 
325690112.5%
 
425653112.5%
 
125633112.5%
 
225524712.5%
 

Most frequent Other Punctuation characters

ValueCountFrequency (%) 
.1025010100.0%
 

Most occurring scripts

ValueCountFrequency (%) 
Common3075030100.0%
 

Most frequent Common characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
32569018.4%
 
42565318.3%
 
12563318.3%
 
22552478.3%
 

Most occurring blocks

ValueCountFrequency (%) 
ASCII3075030100.0%
 

Most frequent ASCII characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
32569018.4%
 
42565318.3%
 
12563318.3%
 
22552478.3%
 

att_6
Real number (ℝ≥0)

Distinct count13
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean6.99924586101599
Minimum1.0
Maximum13.0
Zeros0
Zeros (%)0.0%
Memory size7.8 MiB
2020-08-24T23:50:57.956173image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile1
Q14
median7
Q310
95-th percentile13
Maximum13
Range12
Interquartile range (IQR)6

Descriptive statistics

Standard deviation3.74196432
Coefficient of variation (CV)0.5346239287
Kurtosis-1.213944154
Mean6.999245861
Median Absolute Deviation (MAD)3
Skewness-0.0003795025901
Sum7174297
Variance14.00229697
2020-08-24T23:50:58.082308image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
7794447.8%
 
10790737.7%
 
1790697.7%
 
2789697.7%
 
8789057.7%
 
5788717.7%
 
11788657.7%
 
12788557.7%
 
13787657.7%
 
4787487.7%
 
6785877.7%
 
3785187.7%
 
9783417.6%
 
ValueCountFrequency (%) 
1790697.7%
 
2789697.7%
 
3785187.7%
 
4787487.7%
 
5788717.7%
 
6785877.7%
 
7794447.8%
 
8789057.7%
 
9783417.6%
 
10790737.7%
 
ValueCountFrequency (%) 
13787657.7%
 
12788557.7%
 
11788657.7%
 
10790737.7%
 
9783417.6%
 
8789057.7%
 
7794447.8%
 
6785877.7%
 
5788717.7%
 
4787487.7%
 

att_7
Categorical

Distinct count4
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size7.8 MiB
3
256914
2
256530
4
255816
1
255750
ValueCountFrequency (%) 
325691425.1%
 
225653025.0%
 
425581625.0%
 
125575025.0%
 
2020-08-24T23:51:02.696711image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Length

Max length3
Median length3
Mean length3
Min length3

Overview of Unicode Properties

Unique unicode characters6
Unique unicode categories (?)2
Unique unicode scripts (?)1
Unique unicode blocks (?)1
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Most occurring characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
32569148.4%
 
22565308.3%
 
42558168.3%
 
12557508.3%
 

Most occurring categories

ValueCountFrequency (%) 
Decimal Number205002066.7%
 
Other Punctuation102501033.3%
 

Most frequent Decimal Number characters

ValueCountFrequency (%) 
0102501050.0%
 
325691412.5%
 
225653012.5%
 
425581612.5%
 
125575012.5%
 

Most frequent Other Punctuation characters

ValueCountFrequency (%) 
.1025010100.0%
 

Most occurring scripts

ValueCountFrequency (%) 
Common3075030100.0%
 

Most frequent Common characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
32569148.4%
 
22565308.3%
 
42558168.3%
 
12557508.3%
 

Most occurring blocks

ValueCountFrequency (%) 
ASCII3075030100.0%
 

Most frequent ASCII characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
32569148.4%
 
22565308.3%
 
42558168.3%
 
12557508.3%
 

att_8
Real number (ℝ≥0)

Distinct count13
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean7.000838040604482
Minimum1.0
Maximum13.0
Zeros0
Zeros (%)0.0%
Memory size7.8 MiB
2020-08-24T23:51:02.826525image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile1
Q14
median7
Q310
95-th percentile13
Maximum13
Range12
Interquartile range (IQR)6

Descriptive statistics

Standard deviation3.74142301
Coefficient of variation (CV)0.5344250201
Kurtosis-1.214147951
Mean7.000838041
Median Absolute Deviation (MAD)3
Skewness-0.0005056426607
Sum7175929
Variance13.99824614
2020-08-24T23:51:02.956621image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
7795247.8%
 
2790757.7%
 
9790367.7%
 
11788957.7%
 
12788537.7%
 
13788347.7%
 
4787907.7%
 
10787637.7%
 
3787497.7%
 
1787177.7%
 
5786987.7%
 
8785827.7%
 
6784947.7%
 
ValueCountFrequency (%) 
1787177.7%
 
2790757.7%
 
3787497.7%
 
4787907.7%
 
5786987.7%
 
6784947.7%
 
7795247.8%
 
8785827.7%
 
9790367.7%
 
10787637.7%
 
ValueCountFrequency (%) 
13788347.7%
 
12788537.7%
 
11788957.7%
 
10787637.7%
 
9790367.7%
 
8785827.7%
 
7795247.8%
 
6784947.7%
 
5786987.7%
 
4787907.7%
 

att_9
Categorical

Distinct count4
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size7.8 MiB
1
257063
4
256483
3
255986
2
255478
ValueCountFrequency (%) 
125706325.1%
 
425648325.0%
 
325598625.0%
 
225547824.9%
 
2020-08-24T23:51:07.602798image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Length

Max length3
Median length3
Mean length3
Min length3

Overview of Unicode Properties

Unique unicode characters6
Unique unicode categories (?)2
Unique unicode scripts (?)1
Unique unicode blocks (?)1
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Most occurring characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
12570638.4%
 
42564838.3%
 
32559868.3%
 
22554788.3%
 

Most occurring categories

ValueCountFrequency (%) 
Decimal Number205002066.7%
 
Other Punctuation102501033.3%
 

Most frequent Decimal Number characters

ValueCountFrequency (%) 
0102501050.0%
 
125706312.5%
 
425648312.5%
 
325598612.5%
 
225547812.5%
 

Most frequent Other Punctuation characters

ValueCountFrequency (%) 
.1025010100.0%
 

Most occurring scripts

ValueCountFrequency (%) 
Common3075030100.0%
 

Most frequent Common characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
12570638.4%
 
42564838.3%
 
32559868.3%
 
22554788.3%
 

Most occurring blocks

ValueCountFrequency (%) 
ASCII3075030100.0%
 

Most frequent ASCII characters

ValueCountFrequency (%) 
.102501033.3%
 
0102501033.3%
 
12570638.4%
 
42564838.3%
 
32559868.3%
 
22554788.3%
 

att_10
Real number (ℝ≥0)

Distinct count13
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean6.98882840167413
Minimum1.0
Maximum13.0
Zeros0
Zeros (%)0.0%
Memory size7.8 MiB
2020-08-24T23:51:07.734442image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile1
Q14
median7
Q310
95-th percentile13
Maximum13
Range12
Interquartile range (IQR)6

Descriptive statistics

Standard deviation3.739935723
Coefficient of variation (CV)0.5351305695
Kurtosis-1.214092216
Mean6.988828402
Median Absolute Deviation (MAD)3
Skewness0.003976262213
Sum7163619
Variance13.98711921
2020-08-24T23:51:07.859919image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
3794807.8%
 
2793797.7%
 
4792007.7%
 
9792007.7%
 
8790407.7%
 
10789247.7%
 
5787877.7%
 
6787617.7%
 
1787077.7%
 
7785157.7%
 
13784797.7%
 
12783037.6%
 
11782357.6%
 
ValueCountFrequency (%) 
1787077.7%
 
2793797.7%
 
3794807.8%
 
4792007.7%
 
5787877.7%
 
6787617.7%
 
7785157.7%
 
8790407.7%
 
9792007.7%
 
10789247.7%
 
ValueCountFrequency (%) 
13784797.7%
 
12783037.6%
 
11782357.6%
 
10789247.7%
 
9792007.7%
 
8790407.7%
 
7785157.7%
 
6787617.7%
 
5787877.7%
 
4792007.7%
 

target
Real number (ℝ≥0)

ZEROS

Distinct count10
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.6170056877493878
Minimum0.0
Maximum9.0
Zeros513702
Zeros (%)50.1%
Memory size7.8 MiB
2020-08-24T23:51:08.162018image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q31
95-th percentile2
Maximum9
Range9
Interquartile range (IQR)1

Descriptive statistics

Standard deviation0.7737462783
Coefficient of variation (CV)1.254034272
Kurtosis7.738836538
Mean0.6170056877
Median Absolute Deviation (MAD)0
Skewness2.006835438
Sum632437
Variance0.5986833031
2020-08-24T23:51:08.286447image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
051370250.1%
 
143309742.3%
 
2488284.8%
 
3216342.1%
 
439780.4%
 
520500.2%
 
614600.1%
 
7236< 0.1%
 
817< 0.1%
 
98< 0.1%
 
ValueCountFrequency (%) 
051370250.1%
 
143309742.3%
 
2488284.8%
 
3216342.1%
 
439780.4%
 
520500.2%
 
614600.1%
 
7236< 0.1%
 
817< 0.1%
 
98< 0.1%
 
ValueCountFrequency (%) 
98< 0.1%
 
817< 0.1%
 
7236< 0.1%
 
614600.1%
 
520500.2%
 
439780.4%
 
3216342.1%
 
2488284.8%
 
143309742.3%
 
051370250.1%
 

Interactions

2020-08-24T23:50:23.390461image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:23.867977image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:24.339804image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:24.821248image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:25.302458image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:25.753918image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:26.214746image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:26.682149image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:27.150551image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:27.615755image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:28.086524image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:28.536803image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:29.004962image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:29.480223image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:29.947788image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:30.426862image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:30.889583image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:31.327914image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:31.805936image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:32.278359image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:32.743389image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:33.221823image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:33.699819image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:34.145442image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:34.602087image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:35.070931image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:35.547789image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:36.023180image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:36.504607image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:37.102313image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:37.563622image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:38.023316image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:38.495633image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:38.962081image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:39.421906image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:39.860497image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Correlations

2020-08-24T23:51:08.411139image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
2020-08-24T23:51:08.647573image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
2020-08-24T23:51:08.879897image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
2020-08-24T23:51:09.118774image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.
2020-08-24T23:51:09.327883image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

2020-08-24T23:50:40.503608image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:50:41.497567image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Sample

First rows

att_1att_2att_3att_4att_5att_6att_7att_8att_9att_10target
01.010.01.011.01.013.01.012.01.01.09.0
12.011.02.013.02.010.02.012.02.01.09.0
23.012.03.011.03.013.03.010.03.01.09.0
34.010.04.011.04.01.04.013.04.012.09.0
44.01.04.013.04.012.04.011.04.010.09.0
51.02.01.04.01.05.01.03.01.06.08.0
61.09.01.012.01.010.01.011.01.013.08.0
72.01.02.02.02.03.02.04.02.05.08.0
83.05.03.06.03.09.03.07.03.08.08.0
94.01.04.04.04.02.04.03.04.05.08.0

Last rows

att_1att_2att_3att_4att_5att_6att_7att_8att_9att_10target
10250002.012.04.03.01.03.03.05.03.02.01.0
10250011.04.04.08.04.05.03.09.02.01.00.0
10250021.09.03.06.02.08.03.05.02.09.01.0
10250031.012.03.09.03.06.01.03.01.09.01.0
10250043.07.01.06.04.012.02.01.01.04.00.0
10250053.01.01.012.02.09.04.09.02.06.01.0
10250063.03.04.05.02.07.01.04.04.03.01.0
10250071.011.04.07.03.09.01.013.02.07.01.0
10250083.011.01.08.01.01.03.013.02.08.01.0
10250092.05.02.09.04.09.02.03.03.03.02.0

Duplicate rows

Most frequent

att_1att_2att_3att_4att_5att_6att_7att_8att_9att_10targetcount
4101.010.02.03.03.06.03.012.02.04.00.03
9112.09.01.01.02.010.03.04.04.013.00.03
20274.08.03.08.02.02.04.01.02.07.01.03
01.01.01.03.01.06.02.07.04.012.00.02
11.01.01.03.03.04.04.02.03.010.00.02
21.01.01.010.02.011.02.010.03.02.01.02
31.01.01.012.03.012.04.013.01.011.01.02
41.01.02.01.01.08.04.02.02.08.02.02
51.01.02.01.03.03.03.05.04.09.01.02
61.01.02.02.04.07.03.011.02.08.00.02