```
\documentclass{article}
\usepackage{arxiv}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
\usepackage{hyperref} % hyperlinks
\usepackage{url} % simple URL typesetting
\usepackage{booktabs} % professional-quality tables
\usepackage{amsfonts} % blackboard math symbols
\usepackage{nicefrac} % compact symbols for 1/2, etc.
\usepackage{microtype} % microtypography
\usepackage{lipsum}
\usepackage{graphicx}
\title{Determining The Cost Of A Rail Ticket}
\author{
Archana R Warrier \\
Department of Mathematics and Computing\\
Birla Institue of Technology,Mesra\\
Ranchi, 835217 \\
\texttt{archanarw@gmail.com} \\
%% examples of more authors
\And
Arjita Basu \\
Department of Mathematics and Computing\\
Birla Institue of Technology,Mesra\\
Ranchi, 835217 \\
\texttt{arjita.basu@gmail.com} \\
\AND
Ankit Tewari \\
Artificial Intelligence Engineer \\
Knowledge Engineering and Machine Learning Group \\
\texttt{ankit.tewari@estudiant.upc.edu} \\
%% \And
%% Coauthor \\
%% Affiliation \\
%% Address \\
%% \texttt{email} \\
%% \And
%% Coauthor \\
%% Affiliation \\
%% Address \\
%% \texttt{email} \\
}
\begin{document}
\maketitle
\begin{abstract}
We have used the dataset on Spanish railways which is available to the public and is available at: \url{https://www.kaggle.com/thegurus/spanish-high-speed-rail-system-ticket-pricing}.
Link to the project file in GitHub is:
\url{https://github.com/archanarw/Train-pricing/blob/master/TrainTicketPricing.ipynb} .
Link to the project file in Kaggle is:
\url{https://www.kaggle.com/arjita2000/spanish-train-system-data-analysis/edit}
\end{abstract}
\section{INTRODUCTION}
Spain has an extensive high-speed train network,operated by a few major operators, one of them being RENFE. Determining the price of a rail ticket of Spanish high speed railways beforehand is a challenge for travellers. It may depend on various factors such as train type, class, origin and destination city. This project aims to develop a price prediction model using linear regression and KNN to tackle above challenge.
\section{DATA AND ITS PREPROCESSING}
\label{sec:headings}
The data source for this study is the Spanish High Speed Rail tickets pricing -Renfe.The dataset includes 25,79,771 entries each with 9 features.Figure 1 shows the first five rows of the dataframe which we are going to work with.
\begin{figure}[h]
\centering
\includegraphics[width=10cm]{Capture.PNG}
\caption{The first five rows of dataframe.}
\label{fig:Capture1}
\end{figure}
The data had 3,10,681 null values in price column, 9664 null values in train-class column and 9664 again in fare column. The null values of price column were replaced by the mean of the price column after which, the rows containing null values of fare and train-class columns were dropped. This left us with 25,70,107 rows to be evaluated.
\begin{figure}[h]
\includegraphics[width=8cm]{heatmap.png}
\centering
\caption{Correlation heatmap of the dataset}
\label{fig:c-heatmap}
\centering
\end{figure}
\section{VISUALIZING THE DATA}
\label{sec:headings}
Let us check some stats of the data by its visualization. There are a number of bar plots given below.
\begin{figure}[h]
\centering
\begin{minipage}[b]{0.45\textwidth}
\includegraphics[width=\textwidth]{1.PNG}
\caption{No. of people boarding from each station.}
\label{fig:1}
\end{minipage}
\hfill
\begin{minipage}[b]{0.45\textwidth}
\includegraphics[width=\textwidth]{3.PNG}
\caption{No. of people getting off at each station.}
\label{fig:3}
\end{minipage}
\end{figure}
\begin{figure}[h]
\centering
\begin{minipage}[b]{0.45\textwidth}
\includegraphics[width=\textwidth]{2.PNG}
\caption{Train type vs. Price}
\label{fig:2}
\end{minipage}
\hfill
\begin{minipage}[b]{0.45\textwidth}
\includegraphics[width=\textwidth]{4.PNG}
\caption{Train Class vs. Price}
\label{fig:4}
\end{minipage}
\end{figure}
\begin{figure}[h]
\centering
\begin{minipage}[b]{0.45\textwidth}
\includegraphics[width=\textwidth]{5.PNG}
\caption{No. of trains of each train-class.}
\label{fig:5}
\end{minipage}
\hfill
\begin{minipage}[b]{0.45\textwidth}
\includegraphics[width=\textwidth]{6.PNG}
\caption{No. of tickets bought from each train category.}
\label{fig:6}
\end{minipage}
\end{figure}
From Figure 3 we can infer that maximum number of passengers board from Madrid.Figure 4 shows us that maximum passengers get off at Madrid. Figure 5 indicates that train-type AVE-TGV is the costliest whereas Regional happens to be the cheapest. In Figure 6 we can observe that Cama G. clase is the costliest class whereas Turista con enlace is the cheapest class. Figure 7 shows us that Turista trains are the maximum in number while Figure 8 shows that promo type tickets have been sold the maximum number of times.
\section{REGRESSION ANALYSIS}
\subsection{Linear Regression}
Using the library scikit-learn, linear regression was employed on the dataset.The features considered for determining the train prices were - origin, destination, train type, fare, train class and travel time.
Most of the features were categorical, therefore to be able to apply regression, we used label encoder to transform the data into numerical format.
Here, after converting the data to numerical format and later applying linear regression, we found the linear regression score to be 0.6267. The score is found to check if the testing dataset yielded results which were similar to the expected/actual result.
\subsection{KNN}
KNN can be used for both regression and classification problems. The algorithm is such that the new point is assigned a value based on how closely it resembles the points in the training set. We used the algorithm with K as 5 and also found the mean squared error in predicting the prices of train tickets.
\section{Conclusions}
From observing the diagrams, the first conclusion we made was that the maximum number of people that boarded the trains were from Madrid and the maximum occurring destination was also Madrid.
We also see that the correlation between the start date - month and hour with the price and similarly the correlation between end date - month and hour were minimal. The journey duration had some effect on price of the tickets.
\section{Acknowledgement}
We would like to express our special thanks of gratitude to Ankit Tewari, who guided us on the project.Doing this study also helped us in doing a lot of Research and getting to know about so many new things.
Secondly we would also like to thank our parents and friends who helped us a lot in finalizing this project within the limited time frame.
\bibliographystyle{unsrt}
%\bibliography{references} %%% Remove comment to use the external .bib file (using bibtex).
%%% and comment out the ``thebibliography'' section.
%%% Comment out this section when you \bibliography{Resources} is enabled.
\begin{thebibliography}{}
\bibitem{}
https://www.kaggle.com/thegurus/spanish-high-speed-rail-system-ticket-pricing
\bibitem{}
https://www.analyticsvidhya.com/blog/2018/08/k-nearest-neighbor-introduction-regression-python/
\end{thebibliography}
\end{document}
```