"fit()" vs "transform()" vs "fit_transform()" in sklearn pipeline
Some transformations performed on the training set of data are calculated using the training set e.g. mean and std of StandardScaler.
However, these also have to be applied to the testing set (e.g. in cross-validation), or to newly obtained examples before forecast, using the same parameters (mean and std) used for the training set.
Hence,
- Sklearn's fit() just calculates the parameters (e.g. mean and std in case of StandardScaler) and saves them as an internal objects state.
- The transform() method can be called afterwards to apply the transformation to a particular set of examples.
- fit_transform() joins these two steps and is used for the initial fitting of parameters on the training set, and also returns a transformed set. Internally, it just calls first fit() and then transform() on the same data.