Problem description

I have some census data, which are 20 excel files for 20 year respectively (1980 - 1999). Each of the files contains county names and corresponding census data, e.g., GRDP, population, education.

The aim is to get GDP data from each year, and save them in one excel file.

I used Python 3.7 pandas packge to solve this problem.


Code

Function 1: get GPD column from a excel file

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# get GRDP and name from a certain year

def getGRDP(year):
# read data from excel
file_name = "Jiangsu_" + str(year) + ".xls"
data = pd.read_excel(file_name)

# form column names
name = "_" + str(year) + "$.name"
gdp = "_" + str(year) + "$.GRDP"

# get name and GRDP column
data = data.loc[:, [name, gdp]][:60]

return data

Function 2: join a column to an existing table

1
2
3
4
5
6
7
8
9
10
11
12
# join a table to the table of 1999

def join2_table(year, result):

targetGDP = getGRDP(year)

# get index name
index_name = "_"+ str(year) + "$.name"

result = result.join(targetGDP.set_index(index_name))

return result

Global variable

1
2
3
4
5
# initialize
result = getGRDP(1999)

# set key for the table
result.set_index("_1999$.name")

Add all the GPD column to the global variable

1
2
3
4
for i in range(1998, 1980, -1):
data = getGRDP(i)
index_name = '_' + str(i) + "$.name"
result = result.join(data.set_index(index_name))