lsq546397641的个人博客分享 http://blog.sciencenet.cn/u/lsq546397641

博文

pandas中阶实战

已有 899 次阅读 2023-7-8 15:09 |系统分类:科研笔记

模拟数据:

df = pd.DataFrame({'region':['','','','','','',''],

                'province':['山东','陕西','辽宁','浙江','浙江','山东','陕西'],

                'gdp-p': [8.74,3.28,2.9,7.77,7.77,8.74,3.28],

                'city': ['济南','西安','沈阳','杭州','宁波','青岛','咸阳'],

                'gdp-c': [1.2,1.15,0.77,1.88,1.57,1.49,0.28]})


1.转换为列表

目的:生成北方5万亿级省份列表

p_north_5_club = df[df['5_club-p-north'] == 'Y']['province'].tolist()

image.png

2.转换为字典

2-1

目的:生成城市和其对应gdp的字典

city_gdp = {}

city_gdp.update(dict(zip(df['city'], df['gdp-c'])))

image.png

2-2

目的:生成城市和其对应gdp的字典

city_gdp = df.set_index('city')['gdp-c'].to_dict()

image.png

3.删除某些列

目的:删除region5_club-p-north

df.drop(columns=['region','5_club-p-north'], inplace=True)

说明 :Inplace:直接在原来df上修改

image.png

4.列名重命名

目的:region重命名区域,city重命名市

columnName_new = {'region':'区域',

              'city':''}

df.rename(columns=columnName_new,inplace=True)

image.png

5.处理某一列

目的:gdp-p小数点后保留一位

def decimal_point(value):

    return round(value,1)

df['gdp-p'] = df['gdp-p'].apply(decimal_point)

image.png

6.增加新列

6-1 df.loc[:,'新列名'] = df.apply(lambda x:func(x),axis=1)

目的:新增一列trillionClub-c<城市万亿级俱乐部>

def trillionClub(df):

    result = ''

    if df['gdp-c'] >= 1:

        result = 'Yes'

    else:

        result = 'No'

    return result

df.loc[:,'trillionClub-c'] = df.apply(lambda x:trillionClub(x),axis=1)

image.png

6-2

目的:新增一列5_club-p-north<北方省份5万亿俱乐部>

def merge_dfColumns(x,y):

    result = ''

    if x == '' and y >=5:

        result = 'Y'

    else:

        result = 'N'

    return result

df['5_club-p-north']=list(map(lambda x,y: merge_dfColumns(x,y), df['region'], df['gdp-p']))

image.png

7.排序

.sort_values()

目的:按照region分组,各省市按照gdp降序排序

df = df.sort_values(by=['region','gdp-p','province','gdp-c'],ascending= [True,False,True,False])

说明 :ascending: False降序;True升序

image.png

8.组内筛选并排序

目的:万亿级俱乐部城市,在省内排名<1开始>;非万亿级俱乐部城市排名为0。列名为1_club-c_order

def group_order_city(df,city_groupOrder):

    # groupOrder = ''

    if df['city'] in city_groupOrder.keys():

        groupOrder = city_groupOrder[df['city']]

    else:

        groupOrder = '0'

    return groupOrder

city_groupOrder = {}

df_GroupOrder = df.groupby(['province','1_club-c'])

for name,group in df_GroupOrder:

    if name[1] == 'Yes':

        group.index = group.reset_index(drop=True).index+1  # index1开始

        group.reset_index(inplace=True)  # index为一列

        city_groupOrder.update(dict(zip(group['city'], group['index'])))

    else:

        pass

df.loc[:,'1_club-c_order'] = df.apply(lambda x:group_order_city(x,city_groupOrder),axis=1)

image.png

9.将结果写入一个xlsx的不同sheet

writer = pd.ExcelWriter(‘test.xlsx’)

df1.to_excel(writer, sheet_name='df1',index=False)

df2.to_excel(writer, sheet_name='df2',index=False)

writer.close()

说明 : index=False : 索引号不写入




https://m.sciencenet.cn/blog-994715-1394576.html

上一篇:taxonkit联合csvtk,生成Taxonomy(NCBI)
下一篇:pandas数据合并pd.merge

0

该博文允许注册用户评论 请点击登录 评论 (22 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-2-25 04:08

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部