Overview

Dataset statistics

Number of variables10
Number of observations116640
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory8.9 MiB
Average record size in memory80.0 B

Variable types

BOOL4
NUM4
CAT2

Reproduction

Analysis started2020-08-24 23:49:16.714664
Analysis finished2020-08-24 23:49:23.510658
Duration6.8 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

inv-nodes has 52584 (45.1%) zeros Zeros
breast-quad has 43116 (37.0%) zeros Zeros

Variables

age
Real number (ℝ≥0)

Distinct count115774
Unique (%)99.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean50.887072935451044
Minimum22.70721244812012
Maximum75.94065856933594
Zeros0
Zeros (%)0.0%
Memory size911.4 KiB
2020-08-24T23:49:23.606746image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Quantile statistics

Minimum22.70721245
5-th percentile34.60799923
Q144.58634472
median50.62482643
Q357.25482273
95-th percentile66.78811798
Maximum75.94065857
Range53.23344612
Interquartile range (IQR)12.66847801

Descriptive statistics

Standard deviation9.674236425
Coefficient of variation (CV)0.1901118667
Kurtosis-0.6643299387
Mean50.88707294
Median Absolute Deviation (MAD)6.268323898
Skewness-0.007419632656
Sum5935468.187
Variance93.5908504
2020-08-24T23:49:23.712145image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
44.22895053< 0.1%
 
45.385021212< 0.1%
 
51.220756532< 0.1%
 
50.485755922< 0.1%
 
46.563270572< 0.1%
 
46.409015662< 0.1%
 
50.151874542< 0.1%
 
46.054870612< 0.1%
 
53.802623752< 0.1%
 
65.602317812< 0.1%
 
49.038604742< 0.1%
 
68.139411932< 0.1%
 
65.033927922< 0.1%
 
48.847869872< 0.1%
 
52.094596862< 0.1%
 
65.632789612< 0.1%
 
45.151927952< 0.1%
 
52.367309572< 0.1%
 
51.709774022< 0.1%
 
50.975780492< 0.1%
 
44.809860232< 0.1%
 
46.983966832< 0.1%
 
64.819175722< 0.1%
 
36.381050112< 0.1%
 
45.230831152< 0.1%
 
Other values (115749)116589> 99.9%
 
ValueCountFrequency (%) 
22.707212451< 0.1%
 
23.760665891< 0.1%
 
24.175142291< 0.1%
 
24.306253431< 0.1%
 
24.700948721< 0.1%
 
24.75903321< 0.1%
 
25.095375061< 0.1%
 
25.194074631< 0.1%
 
25.220729831< 0.1%
 
25.31055451< 0.1%
 
ValueCountFrequency (%) 
75.940658571< 0.1%
 
75.769485471< 0.1%
 
75.548027041< 0.1%
 
75.341720581< 0.1%
 
75.171508791< 0.1%
 
75.167762761< 0.1%
 
75.078895571< 0.1%
 
75.044990541< 0.1%
 
75.041862491< 0.1%
 
74.976997381< 0.1%
 

menopause
Categorical

Distinct count3
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size911.4 KiB
0
58516
1
50847
2
 
7277
ValueCountFrequency (%) 
05851650.2%
 
15084743.6%
 
272776.2%
 
2020-08-24T23:49:24.343004image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Length

Max length1
Median length1
Mean length1
Min length1

Overview of Unicode Properties

Unique unicode characters3
Unique unicode categories (?)1
Unique unicode scripts (?)1
Unique unicode blocks (?)1
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Most occurring characters

ValueCountFrequency (%) 
05851650.2%
 
15084743.6%
 
272776.2%
 

Most occurring categories

ValueCountFrequency (%) 
Decimal Number116640100.0%
 

Most frequent Decimal Number characters

ValueCountFrequency (%) 
05851650.2%
 
15084743.6%
 
272776.2%
 

Most occurring scripts

ValueCountFrequency (%) 
Common116640100.0%
 

Most frequent Common characters

ValueCountFrequency (%) 
05851650.2%
 
15084743.6%
 
272776.2%
 

Most occurring blocks

ValueCountFrequency (%) 
ASCII116640100.0%
 

Most frequent ASCII characters

ValueCountFrequency (%) 
05851650.2%
 
15084743.6%
 
272776.2%
 

inv-nodes
Real number (ℝ≥0)

ZEROS

Distinct count18
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.1407150205761316
Minimum0.0
Maximum17.0
Zeros52584
Zeros (%)45.1%
Memory size911.4 KiB
2020-08-24T23:49:24.455533image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median1
Q34
95-th percentile14
Maximum17
Range17
Interquartile range (IQR)4

Descriptive statistics

Standard deviation4.42793718
Coefficient of variation (CV)1.409850034
Kurtosis1.43983298
Mean3.140715021
Median Absolute Deviation (MAD)1
Skewness1.542442221
Sum366333
Variance19.60662767
2020-08-24T23:49:24.572033image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
05258445.1%
 
31416812.1%
 
1104348.9%
 
278886.8%
 
751804.4%
 
1031672.7%
 
429552.5%
 
828462.4%
 
927962.4%
 
525262.2%
 
1718301.6%
 
1416151.4%
 
615951.4%
 
1215431.3%
 
1515401.3%
 
1314541.2%
 
1113611.2%
 
1611581.0%
 
ValueCountFrequency (%) 
05258445.1%
 
1104348.9%
 
278886.8%
 
31416812.1%
 
429552.5%
 
525262.2%
 
615951.4%
 
751804.4%
 
828462.4%
 
927962.4%
 
ValueCountFrequency (%) 
1718301.6%
 
1611581.0%
 
1515401.3%
 
1416151.4%
 
1314541.2%
 
1215431.3%
 
1113611.2%
 
1031672.7%
 
927962.4%
 
828462.4%
 

node-caps
Boolean

Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size911.4 KiB
0
88561
1
28079
ValueCountFrequency (%) 
08856175.9%
 
12807924.1%
 

deg-malig
Categorical

Distinct count3
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size911.4 KiB
2
52605
1
36457
0
27578
ValueCountFrequency (%) 
25260545.1%
 
13645731.3%
 
02757823.6%
 
2020-08-24T23:49:25.219870image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Length

Max length3
Median length3
Mean length3
Min length3

Overview of Unicode Properties

Unique unicode characters4
Unique unicode categories (?)2
Unique unicode scripts (?)1
Unique unicode blocks (?)1
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Most occurring characters

ValueCountFrequency (%) 
014421841.2%
 
.11664033.3%
 
25260515.0%
 
13645710.4%
 

Most occurring categories

ValueCountFrequency (%) 
Decimal Number23328066.7%
 
Other Punctuation11664033.3%
 

Most frequent Decimal Number characters

ValueCountFrequency (%) 
014421861.8%
 
25260522.6%
 
13645715.6%
 

Most frequent Other Punctuation characters

ValueCountFrequency (%) 
.116640100.0%
 

Most occurring scripts

ValueCountFrequency (%) 
Common349920100.0%
 

Most frequent Common characters

ValueCountFrequency (%) 
014421841.2%
 
.11664033.3%
 
25260515.0%
 
13645710.4%
 

Most occurring blocks

ValueCountFrequency (%) 
ASCII349920100.0%
 

Most frequent ASCII characters

ValueCountFrequency (%) 
014421841.2%
 
.11664033.3%
 
25260515.0%
 
13645710.4%
 

breast
Boolean

Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size911.4 KiB
1
61286
0
55354
ValueCountFrequency (%) 
16128652.5%
 
05535447.5%
 

breast-quad
Real number (ℝ≥0)

ZEROS

Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.4569530178326475
Minimum0
Maximum4
Zeros43116
Zeros (%)37.0%
Memory size911.4 KiB
2020-08-24T23:49:25.339066image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10
median2
Q32
95-th percentile4
Maximum4
Range4
Interquartile range (IQR)2

Descriptive statistics

Standard deviation1.313991617
Coefficient of variation (CV)0.9018764507
Kurtosis-1.040118538
Mean1.456953018
Median Absolute Deviation (MAD)1
Skewness0.3117823196
Sum169939
Variance1.726573968
2020-08-24T23:49:25.449102image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
04311637.0%
 
23909233.5%
 
31421812.2%
 
1105859.1%
 
496298.3%
 
ValueCountFrequency (%) 
04311637.0%
 
1105859.1%
 
23909233.5%
 
31421812.2%
 
496298.3%
 
ValueCountFrequency (%) 
496298.3%
 
31421812.2%
 
23909233.5%
 
1105859.1%
 
04311637.0%
 
Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size911.4 KiB
0
83941
1
32699
ValueCountFrequency (%) 
08394172.0%
 
13269928.0%
 

recurrence
Boolean

Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size911.4 KiB
0
76282
1
40358
ValueCountFrequency (%) 
07628265.4%
 
14035834.6%
 

target
Real number (ℝ)

Distinct count92234
Unique (%)79.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean24.69291394898318
Minimum-8.526242256164549
Maximum62.008262634277344
Zeros0
Zeros (%)0.0%
Memory size911.4 KiB
2020-08-24T23:49:25.613171image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Quantile statistics

Minimum-8.526242256
5-th percentile7.176139021
Q118.91156435
median25.13337612
Q330
95-th percentile43.10170422
Maximum62.00826263
Range70.53450489
Interquartile range (IQR)11.08843565

Descriptive statistics

Standard deviation10.3484387
Coefficient of variation (CV)0.4190853588
Kurtosis-0.1868727427
Mean24.69291395
Median Absolute Deviation (MAD)5.033381462
Skewness0.06398273421
Sum2880181.483
Variance107.0901836
2020-08-24T23:49:25.736820image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
302404620.6%
 
45.781005863< 0.1%
 
38.452796942< 0.1%
 
24.751682282< 0.1%
 
18.670921332< 0.1%
 
25.769514082< 0.1%
 
19.600351332< 0.1%
 
25.162429812< 0.1%
 
47.651535032< 0.1%
 
24.167991642< 0.1%
 
25.034269332< 0.1%
 
26.192228322< 0.1%
 
21.579458242< 0.1%
 
20.567346572< 0.1%
 
20.484970092< 0.1%
 
39.252372742< 0.1%
 
19.918390272< 0.1%
 
21.517204282< 0.1%
 
18.805185322< 0.1%
 
24.664079672< 0.1%
 
24.72253992< 0.1%
 
24.934408192< 0.1%
 
37.87100222< 0.1%
 
19.231521612< 0.1%
 
25.017290122< 0.1%
 
Other values (92209)9254579.3%
 
ValueCountFrequency (%) 
-8.5262422561< 0.1%
 
-7.278350831< 0.1%
 
-7.2270002371< 0.1%
 
-6.7977280621< 0.1%
 
-6.5213069921< 0.1%
 
-6.4190778731< 0.1%
 
-6.3600740431< 0.1%
 
-6.1007452011< 0.1%
 
-6.089509011< 0.1%
 
-5.9742298131< 0.1%
 
ValueCountFrequency (%) 
62.008262631< 0.1%
 
58.256912231< 0.1%
 
58.238765721< 0.1%
 
58.081867221< 0.1%
 
58.034782411< 0.1%
 
57.64298631< 0.1%
 
57.510837551< 0.1%
 
57.438232421< 0.1%
 
57.219654081< 0.1%
 
57.208118441< 0.1%
 

Interactions

2020-08-24T23:49:19.657307image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:19.816955image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:19.986828image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:20.164264image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:20.336545image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:20.516580image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:20.705642image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:20.902534image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:21.100062image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:21.284839image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:21.480527image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:21.682068image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:21.875136image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:22.213091image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:22.403195image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:22.595467image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Correlations

2020-08-24T23:49:25.871088image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
2020-08-24T23:49:26.103832image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
2020-08-24T23:49:26.332117image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
2020-08-24T23:49:26.564619image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.
2020-08-24T23:49:26.767918image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

2020-08-24T23:49:22.923360image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/
2020-08-24T23:49:23.252424image/svg+xmlMatplotlib v3.3.1, https://matplotlib.org/

Sample

First rows

agemenopauseinv-nodesnode-capsdeg-maligbreastbreast-quadirradiationrecurrencetarget
043.29639820.002.0140030.000000
153.44754010.000.0120020.076019
250.26210800.000.0010030.000000
344.39871600.000.0100132.415188
449.59851501.002.0130033.421253
567.98273510.002.0120024.455177
662.18797718.002.0020120.927906
764.67324114.011.0110130.000000
834.30197900.002.0020111.513506
942.491680010.001.0130130.000000

Last rows

agemenopauseinv-nodesnode-capsdeg-maligbreastbreast-quadirradiationrecurrencetarget
11663044.60971816.000.0100130.000000
11663139.08313800.002.002004.987052
11663270.72360210.001.0110030.000000
11663338.36068021.011.0101046.038864
11663462.52491413.002.0041020.436518
11663547.38450600.001.0100024.230558
11663651.59913317.012.0100117.847170
11663769.04238910.002.0100015.111092
11663848.26458000.001.0110021.324703
11663964.02800810.002.0110020.348454