Python Data Wrangling with Numpy and Pandas
I came across an interesting problem today. A data provider changed the decimal format of how they provide their financial market prices. This is problematic because my previous database of prices is different. This will cause any analysis we do with a combination of the new data and the old data to be way out of whack.
Luckily, there is a quick an easy way to reformat your data back into your desired result using the numpy.where() method. Numpy is a popular python library that is part of the python science stack and is especially utilized in work with arrays. We will also use pandas.
Let’s do a quick exercise
For example, if you have a simple and example data set of:
date | price_close |
2/1/2022 | 1.01 |
2/2/2022 | 1.02 |
2/3/2022 | 1.03 |
2/4/2022 | 1.04 |
2/5/2022 | 1.05 |
But the new data coming in looks like this:
date | price_close |
2/6/2022 | 0.00106 |
2/7/2022 | 0.00107 |
2/8/2022 | 0.00108 |
2/9/2022 | 0.00109 |
2/10/2022 | 0.0011 |
When you combine the data set, it will look like this:
date | price_close |
2/1/2022 | 1.01 |
2/2/2022 | 1.02 |
2/3/2022 | 1.03 |
2/4/2022 | 1.04 |
2/5/2022 | 1.05 |
2/6/2022 | 0.00106 |
2/7/2022 | 0.00107 |
2/8/2022 | 0.00108 |
2/9/2022 | 0.00109 |
2/10/2022 | 0.0011 |
Ok, so that is the data we have. Let’s turn that small data set into a Pandas dataframe to work with.
import pandas as pd import numpy as np datafolder = 'My_Folder_Location' pricedata = pd.read_csv('{}/Test_Data_Set.csv'.format(datafolder), parse_dates=['date']) df = pd.DataFrame(data=pricedata)
print(df)
Output:
df['price_close'] = np.where(df['date'] >= '2/6/2022', df['price_close']*1000, df['price_close'])
print(df)
And success! We have now correctly formatted the decimals for our prices.
More columns:
If you needed to change multiple columns based on that same condition, you can just add another line (or lines) of code with your desired parameters like the example below.
df['price_close'] = np.where(df['date'] >= '2/6/2022', df['price_close']*1000, df['price_close']) df['second_column'] = np.where(df['date'] >= '2/6/2022', df['second_column']*1000, df['second_column'])
Article by Zachary Storella – See more programming posts on our Python Page